Job Title : Senior Systems Engineer (DevOps & SRE)
Skills : Site Reliability Engineering,DevOps
Location : Hyderabad,Bangalore,Pune,Gurgaon,Chennai
We are looking for a skilled and drivenSite Reliability Engineer (SRE)to become a part of our team.
The chosen candidate will play a key part in safeguarding the Reliability, Scalability, Capacity Planning, and performance of our infrastructure and applications. If you have a rich background in software engineering, system administration, Containerisation, and cloud technologies, you might be our ideal candidate.
Responsibilities :
- Crafting, implementing, and managing scalable, reliable, and secure cloud infrastructure using tools such as Terraform, Kubernetes, and Docker
- Building and maintaining monitoring and alerting systems for application and infrastructure health and performance with tools such as Prometheus, Grafana, and ELK stack
- Leading response efforts for critical incidents, conducting root cause analysis, and implementing long-term fixes to prevent recurrence
- Developing, maintaining, and optimizing continuous integration and continuous deployment (CI/CD) pipelines using tools like Jenkins, GitLab CI, or CircleCI
- Automating routine tasks and enhancing efficiency through scripting and tools, employing languages such as Python, Bash, or Go
- Implementing and managing security best practices for infrastructure and applications, including vulnerability assessments, penetration testing, and adherence to security standards
- Cooperating closely with development, QA, and operations teams to ensure smooth integration and deployment of new features and updates
- Conducting capacity planning and scaling infrastructure to meet present and future demands
- Creating and maintaining thorough documentation for infrastructure, processes, and procedures
Requirements :
- A minimum of 5 years experience in a DevOps/SRE role
- Solid experience with cloud platforms like AWS, GCP, Azure
- Proficiency in infrastructure as code (IaC) tools such as Terraform, CloudFormation
- Significant experience with containerization and orchestration (Docker, Kubernetes)
- In-depth knowledge of CI/CD tools (Jenkins, GitLab CI, CircleCI)
- Proficiency in scripting languages (Python, Bash)
- Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack)
- Capacity to participate in capacity planning and scalability assessments to meet business growth and requirements
- Familiarity with SLI, SLO, SLA, and Error Budget concepts, their implementation, and willingness to provide on-call support and participate in incident management & response activities as needed
- Solid grasp of networking and security principles
- Exceptional problem-solving skills and the ability to work under pressure
- Strong communication and collaboration skills