Set technical direction for the team and get consensus from internal and external partners within the organization.
Collaborate with other team leaders to orchestrate large system changes.
Respond to monitoring alerts according to defined playbooks and procedures.
Enhance playbooks and procedures to reduce on-call toil.
Participate in Post Incident Reviews and discussions.
Provide on-call support & incident management. (On-call support and incident management coverage in India is 12 hours X 7 days a week. However, on-call shifts among team members are flexible).
Ensure stability and performance of production environments.
Deploy software to production environments.
Build effective working relationships with cross-functional team members.
Make suggestions for process improvements and enhance operational efficiencies.
Implement various process improvements and operational efficiencies.
Design and develop tools to increase product resilience and improve developer experience.
Leverage your reliability engineering skills to increase system reliability, reduce toil, automate processes, automate manual tests, automate other manual tasks, and reduce cloud cost.
Mentor new engineers on the team.
Qualifications:
You have B.S. in a related field; 12+ years related experience (or Masters and 8+ years related experience or PhD and 5+ years experience)
You have 5+ years of experience in Systems Administration/SRE in a cloud environment
You have 7+ years in incident response and major incident management.
You enjoy problem solving and analyzing global scale distributed systems.
You are collaborative with strong interpersonal and communication skills, both verbal and written.
You remain calm and collected in stressful situations, such as a major service outage.
You demonstrate attention to detail, follow through, and the ability to prioritize quickly.
You demonstrate good judgment on when to solve problems individually and when to involve others.
You have experience with Cloud Computing Platforms, such as AWS and GCP.
You have experience with Kubernetes and Docker.
You have experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Gitlab, Artifactory, etc.
You have experience with software automation and scripting using modern languages such as Python.
You have experience leading large-scale technical initiatives across multiple teams.
You have excellent problem-solving, triaging, and debugging skills in large-scale distributed systems
You have knowledge of microservices fundamentals including Service Mesh using Istio, service discovery, deployment strategies, monitoring, scheduling, and load balancing.
Nice to have:
Experience in Infrastructure-as-code - Terraform, Helm, YAML.
Experience using Splunk to identify operational issues.
Experience handling SaaS applications for a large customer base.