Principal Cloud Operations Engineer - Observability

Splunk

Early Applicant

a month ago
Be among the first 50 applicants

Exp: 5-12 Years

Full time

Hyderabad / Secunderabad, Telangana, India

Job Description

Join us as we pursue our ground-breaking vision to make machine data accessible, usable, and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk, we are committed to our work, customers, having fun, and most significantly to each other's success.

The Splunk Observability Cloud provides full-fidelity monitoring and fixing across infrastructure, applications, and user interfaces, in real-time and at any scale, to help our customers keep their services reliable, innovate faster, and deliver great customer experiences.

Role

You will help us run one of the largest and most sophisticated cloud-scale, bigdata, and microservices platforms in the world. You will be responsible to monitor and resolve issues that affect the availability and performance of critical components of Splunk Observability Cloud. You will use your Kubernetes, cloud, and infrastructure-as-code knowledge to enhance Splunk Observability Cloud infrastructure while reducing its operational costs. You will use programming/scripting expertise to develop tools and build automation enhancing our product and developer experience. You will use your leadership experience to lead a team of cloud operations engineers and set their technical direction.

Responsibilities:

Set technical direction for the team and get consensus from internal and external partners within the organization.
Collaborate with other team leaders to orchestrate large system changes.
Respond to monitoring alerts according to defined playbooks and procedures.
Enhance playbooks and procedures to reduce on-call toil.
Participate in Post Incident Reviews and discussions.
Provide on-call support & incident management. (On-call support and incident management coverage in India is 12 hours X 7 days a week. However, on-call shifts among team members are flexible).
Ensure stability and performance of production environments.
Deploy software to production environments.
Build effective working relationships with cross-functional team members.
Make suggestions for process improvements and enhance operational efficiencies.
Implement various process improvements and operational efficiencies.
Design and develop tools to increase product resilience and improve developer experience.
Leverage your reliability engineering skills to increase system reliability, reduce toil, automate processes, automate manual tests, automate other manual tasks, and reduce cloud cost.
Mentor new engineers on the team.

Qualifications:

You have B.S. in a related field; 12+ years related experience (or Masters and 8+ years related experience or PhD and 5+ years experience)
You have 5+ years of experience in Systems Administration/SRE in a cloud environment
You have 7+ years in incident response and major incident management.
You enjoy problem solving and analyzing global scale distributed systems.
You are collaborative with strong interpersonal and communication skills, both verbal and written.
You remain calm and collected in stressful situations, such as a major service outage.
You demonstrate attention to detail, follow through, and the ability to prioritize quickly.
You demonstrate good judgment on when to solve problems individually and when to involve others.
You have experience with Cloud Computing Platforms, such as AWS and GCP.
You have experience with Kubernetes and Docker.
You have experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Gitlab, Artifactory, etc.
You have experience with software automation and scripting using modern languages such as Python.
You have experience leading large-scale technical initiatives across multiple teams.
You have excellent problem-solving, triaging, and debugging skills in large-scale distributed systems
You have knowledge of microservices fundamentals including Service Mesh using Istio, service discovery, deployment strategies, monitoring, scheduling, and load balancing.

Nice to have:

Experience in Infrastructure-as-code - Terraform, Helm, YAML.
Experience using Splunk to identify operational issues.
Experience handling SaaS applications for a large customer base.

We are an equal-opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

We value diversity at our company. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying.

Note:

Base Pay Range

India

Base Pay: INR 4,800,

00 - 6,600,000.00 per year

Splunk provides flexibility and choice in the working arrangement for most roles, including remote and/or in-office roles. We have a market-based pay structure which varies by location. Please note that the base pay range is a guideline and for candidates who receive an offer, the base pay will vary based on factors such as work location as set out above, as well as the knowledge, skills and experience of the candidate. In addition to base pay, this role is eligible for incentive compensation and may be eligible for equity or long-term cash awards.

Benefits are an important part of Splunk's Total Rewards package. This role is eligible for a comprehensive, competitive benefits package which may include healthcare and retirement plans, paid time off, wellbeing expense reimbursement, and much more! Learn more about our comprehensive benefits and wellbeing offering at https://splunkbenefits.com.