Job Description And Requirements
We are seeking a highly skilled professional to join our team. The successful candidate will have the responsibility of designing, implementing, and maintaining the observability platform that monitors the health of our production systems. The candidate should have a proven background in software development, system administration, and monitoring tools, as well as a passion for building scalable and reliable systems.
Key Responsibilities
- Design and implement the SRE & Observability platform to monitor the status & health of our production systems providing a holistic view of the environment.
- Partner with other teams to ensure that monitoring tools are effectively integrated with other systems and processes.
- Ensure that the SRE & Observability platform is scalable, reliable, and can handle large volumes of data.
- Implement SRE best practices for the team and identify KPIs for various systems, organizations, and stakeholders.
- Automate the deployment and configuration of monitoring tools to reduce human error and increase efficiency.
- Develop custom scripts and tools to extend the functionality of the monitoring platform, including, but not limited to Proactive remediation and Self-Healing.
- Perform root cause analysis on incidents, prepare detailed reports to present to the stakeholders, and develop solutions to prevent similar incidents from occurring in the future.
- Optimize and refine the SDLC and the On-Call and escalation processes.
- Create documentation for all the systems, tools, and processes created by the team, as well as documenting the learnings from incidents and escalations.
- Provide guidance and mentorship to junior members of the team.
- Drive the design and implementation of major SRE initiatives.
- Act as a SME on SRE & Observability, providing guidance to other teams across the organization.
- Continuously evaluate and implement new tools and technologies to improve the SRE platform.
Qualifications
- Excellent programming and experience skills in Python.
- Experience with data tools such as Elasticsearch is a must. Other technologies such as Prometheus, Grafana, etc. is a plus.
- Good knowledge of Linux OS, Networking and NFS technologies.
- Expertise in Cloud computing platforms such as AWS, GCP, or Azure.
- Familiarity with containerization technologies such as Docker and Kubernetes.
- Experience and knowledge of the SDLC including Source code management tools, CI/CD pipelines and end to end testing.
- Excellent problem-solving skills and attention to detail.
- Ability to work collaboratively with other teams and stakeholders.
- Excellent communication skills, both verbal and written.
- Ability to drive and mentor junior members of the team.
- Experience leading major SRE & Observability initiatives.
- Exceptional knowledge of SRE & Observability best practices and trends.
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- 10+ years of experience in software development, system administration, or a related field.
If you meet the above qualifications and are passionate about building scalable and reliable systems, we encourage you to apply for this exciting opportunity.
Job Category
Information Technology
Country
India
Job Subcategory
Enterprise Solutions Architect
Hire Type
Employee