- System Reliability: Ensuring the reliability of software systems by designing, implementing, and maintaining scalable and reliable infrastructure.
- Automation: Developing automation tools and scripts to streamline operational tasks, reduce manual intervention, and improve overall system efficiency.
- incident Response and Resolution: Monitoring system performance and responding to incidents promptly to minimize downtime and ensure high availability. Lead Incident Management during Incidents.
- Capacity Planning: Analyzing system usage patterns and forecasting future capacity needs to ensure that the infrastructure can handle current and future demands. Responsible for driving MTTR as per the Incident SLA.
- Performance Optimization: Identifying and addressing performance bottlenecks in software systems through optimization and tuning.Responsible for having 100% coverage for various alerts covering Application, Infrastructure, Security, Flows, etc
- Infrastructure as Code (IaC): Implementing infrastructure as code practices, using tools like Terraform or Ansible, to define and manage infrastructure in a version-controlled and automated manner. Own service or services availability.
- Monitoring and Logging: Implementing and maintaining monitoring and logging solutions to gain insights into system behavior, troubleshoot issues, and proactively address potential problems. Collaborate with Product managers, Designers and Developers in self-sufficient teams to implement and follow best SRE practices.
- Security: Collaborating with security teams to implement and maintain security best practices in infrastructure and application. Provide technical guidance to the team on managing availability and performance of mission-critical services on building automation to prevent problem recurrence and building automated responses for non-exceptional service conditions.
- Disaster Recovery Planning: Developing and maintaining disaster recovery plans to ensure that systems can quickly recover from major outages or failures.
- Continuous Improvement: Continuously analyzing system performance, reliability, and incidents to identify areas for improvement and implementing changes to enhance overall system resilience.
Skills
- Programming Languages: Proficiency in one or more programming languages, commonly Python, Go, Shell, Bash.
- Automation and Scripting: Strong automation skills using tools like Ansible, Puppet, Chef, or custom scripts. Knowledge of Infrastructure as Code (IaC) tools like Terraform Containerization and Orchestration: Experience with containerization technologies like Docker and container orchestration platforms like Kubernetes.
- Cloud Computing: Proficiency in any of the cloud platforms such as AWS, Azure, or Google Cloud Platform, and knowledge of managing infrastructure in the cloud.
- Monitoring and Logging: Familiarity with monitoring tools (eg, Prometheus, Grafana, ELK stack) and logging frameworks to track system performance and troubleshoot issues.
- Networking: Understanding of networking concepts, protocols, and troubleshooting skills.
- Security: Knowledge of security best practices, including encryption, access controls, and vulnerability management.
- Continuous Integration/Continuous Deployment (CI/CD): Understanding and implementation of CI/CD pipelines for automated testing and deployment.
- Load Balancing: Experience in incident response, troubleshooting, and resolution.
- Version Control: Proficient use of version control systems like Git.
Experience and Qualifications
- 9+ years of experience in site reliability engineering.
- B.Tech/M.Tech in computer science, information technology or a related field.
- Certifications from cloud service providers like AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or Microsoft Certified is a plus