Systems and Reliability Engineer

Cedana

Early Applicant

5 months ago
Be among the first 50 applicants

Exp: 0-2 Years

India

Job Description

What You Will Do

Youll span the stack from the kernel, system and hypervisor to exploit our unique insights in compute. Youll help increase the reliability of our system all the way from the kernel to our managed kubernetes cloud offering.
You'll interact with customers on a regular basis, triaging issues and develop a working relationship with points of contact across multiple organizations.
You'll help build out internal tooling to measure reliability and success and alerting infrastructure that can help us identify problems quickly; from the kernel all the way to kubernetes.

What We Are Looking For

Someone who doesnt fit in with traditional full stack developers because you are obsessed with understanding how every layer of compute works.
Interest in working in multiple domains and wearing multiple hats.
Ability and experience, or strong interest in learning the compute stack from hardware, device drivers, OS kernel and system, k8s, distributed systems. You dont need to know all of these coming in but are curious and have the intellectual bandwidth to quickly learn them.
Track record of solving challenging problems in systems programming (e.g compilers, distributed systems, embedded systems, highly available systems at scale etc)
Creative problem solving, multidisciplinary experience
Demonstrated ability to collaborate with others

Required Qualifications

Strong understanding of Kubernetes (Controllers, Operators, CRDs)
Strong understanding of Linux and UNIX fundamentals (standard libraries, services, networking, kernel/user-space interaction)
Strong system level programming experience (i.e. C/Rust/Go)
Experience or familiarity with low-level systems programming concepts.
Experience writing Kubernetes controllers or services from scratch.

Preferred Qualifications

Experience with different container runtimes (runc, docker, podman, etc.) and container orchestration.
Contributing to Open Source Projects such as: participating in Cloud Native Computing Foundation (CNCF), Apache Software Foundation (ASF), or Open Source Security Foundation (OpenSSF) is a huge plus!
Experience with Kubernetes system administration (using Helm, Terraform, etc.)
Experience scaling infrastructure out as part of a platform team.
Experience productionizing and managing production-level Kubernetes clusters.
Familiar with being oncall (our founders have experience being oncall, and know how rough it is!)

Nice to Have

Experience supporting data teams with data processing infrastructure (BigQuery, OpenTelemetry, etc.) and implementing observability and monitoring best practices.
Experience with high performance computing (think SLURM).
Experience deploying and scaling ML workloads (training or inference) in production.
Familiarity with problems associated with deploying large scale ML models or batch/scientific compute

Working at Cedana

Were building a unique and powerful system that transforms compute orchestration. Our team is pushing the boundaries of compute performance across multiple layers of the stack.

On top of building a transformative stack, our engineers dig into the linux kernel, spend time bushwacking around kubernetes and runc source code, investigate novel virtualization techniques and pore through open source GPU drivers. By moving fast and shipping quickly, they also get an opportunity to improve performance in real-world, deployed production systems on behalf of our customers - which include leading companies in Computing & GPU Infrastructure, DevTools, and LLM/Foundation Models.

Our company is led by founders with extensive experience in building and scaling successful startups. Our investors including a co-founder of OpenAI, former, Chief Architect of Slack, founding members of Facebook AI and leading VC firms.