Search by job, company or skills

Cedana

Systems and Reliability Engineer

Early Applicant
  • 5 months ago
  • Be among the first 50 applicants

Job Description

What You Will Do

  • Youll span the stack from the kernel, system and hypervisor to exploit our unique insights in compute. Youll help increase the reliability of our system all the way from the kernel to our managed kubernetes cloud offering.
  • You'll interact with customers on a regular basis, triaging issues and develop a working relationship with points of contact across multiple organizations.
  • You'll help build out internal tooling to measure reliability and success and alerting infrastructure that can help us identify problems quickly; from the kernel all the way to kubernetes.

What We Are Looking For

  • Someone who doesnt fit in with traditional full stack developers because you are obsessed with understanding how every layer of compute works.
  • Interest in working in multiple domains and wearing multiple hats.
  • Ability and experience, or strong interest in learning the compute stack from hardware, device drivers, OS kernel and system, k8s, distributed systems. You dont need to know all of these coming in but are curious and have the intellectual bandwidth to quickly learn them.
  • Track record of solving challenging problems in systems programming (e.g compilers, distributed systems, embedded systems, highly available systems at scale etc)
  • Creative problem solving, multidisciplinary experience
  • Demonstrated ability to collaborate with others

Required Qualifications

  • Strong understanding of Kubernetes (Controllers, Operators, CRDs)
  • Strong understanding of Linux and UNIX fundamentals (standard libraries, services, networking, kernel/user-space interaction)
  • Strong system level programming experience (i.e. C/Rust/Go)
  • Experience or familiarity with low-level systems programming concepts.
  • Experience writing Kubernetes controllers or services from scratch.

Preferred Qualifications

  • Experience with different container runtimes (runc, docker, podman, etc.) and container orchestration.
  • Contributing to Open Source Projects such as: participating in Cloud Native Computing Foundation (CNCF), Apache Software Foundation (ASF), or Open Source Security Foundation (OpenSSF) is a huge plus!
  • Experience with Kubernetes system administration (using Helm, Terraform, etc.)
  • Experience scaling infrastructure out as part of a platform team.
  • Experience productionizing and managing production-level Kubernetes clusters.
  • Familiar with being oncall (our founders have experience being oncall, and know how rough it is!)

Nice to Have

  • Experience supporting data teams with data processing infrastructure (BigQuery, OpenTelemetry, etc.) and implementing observability and monitoring best practices.
  • Experience with high performance computing (think SLURM).
  • Experience deploying and scaling ML workloads (training or inference) in production.
  • Familiarity with problems associated with deploying large scale ML models or batch/scientific compute

Working at Cedana

Were building a unique and powerful system that transforms compute orchestration. Our team is pushing the boundaries of compute performance across multiple layers of the stack.

On top of building a transformative stack, our engineers dig into the linux kernel, spend time bushwacking around kubernetes and runc source code, investigate novel virtualization techniques and pore through open source GPU drivers. By moving fast and shipping quickly, they also get an opportunity to improve performance in real-world, deployed production systems on behalf of our customers - which include leading companies in Computing & GPU Infrastructure, DevTools, and LLM/Foundation Models.

Our company is led by founders with extensive experience in building and scaling successful startups. Our investors including a co-founder of OpenAI, former, Chief Architect of Slack, founding members of Facebook AI and leading VC firms.

More Info

Industry:Other

Function:technology

Job Type:Permanent Job

Skills Required

Login to check your skill match score

Login

Date Posted: 27/06/2024

Job ID: 83246349

Report Job

About Company

Follow

Hi , want to stand out? Get your resume crafted by experts.

Similar Jobs

Systems and Infrastructure Engineer III

WalmartCompany Name Confidential

Software Engineer Embedded Systems Runtime and Firmware

GoogleCompany Name Confidential
Last Updated: 29-06-2024 06:05:16 AM
Home Jobs in India Systems and Reliability Engineer