Job Summary
We are seeking a Senior Software Engineer to join our Site Reliability Engineering team, with a focus on Observability and Reliability. As a key member of our SRE team, you will play a critical role in ensuring the performance, stability, and availability of our applications and systems with a focused approach in Application Performance Management, Observability & Reliability of the platform.
The Senior Software Engineer will be responsible for the design, implementation, and maintenance of our observability and reliability infrastructure, with a primary focus on the ELK stack (Elasticsearch, Logstash, and Kibana). The role involves configuring, fine-tuning, and automating alerts, integrating Elastic solutions with other tools and applications, generating reports, and optimizing the observability and monitoring systems.
Key Duties & Responsibilities
1
Collaborate with cross-functional teams to define and implement observability and reliability standards and best practices.
2
Design, deploy, and maintain the ELK stack for log aggregation, monitoring, and analysis.
3
Develop and maintain alerts and monitoring systems, ensuring early detection of issues and rapid incident response.
4
Create, customize, and maintain dashboards in Kibana for different stakeholders.
5
Collaborate with software development teams to identify performance bottlenecks and recommend solutions.
6
Automate manual tasks and workflows to streamline observability and reliability processes.
7
Conduct regular system and application performance analysis and optimization, effective automation & tooling, capacity planning and optimization, security practices and compliance adherence, documentation and knowledge sharing, Disaster Recovery and backup.
8
Generate and deliver detailed reports on system performance and reliability metrics.
9
Stay up to date with industry trends and best practices in observability and reliability engineering.
Qualifications/Skills/Abilities
Minimum Requirements
Formal Education
Bachelors degree in computer science, Information Technology, or a related field (or equivalent experience).
Experience (type & duration)
5+ years of experience in Site Reliability Engineering, Obervability & reliability, DevOps
Skills
- Proficiency in configuring and maintaining the ELK stack (Elasticsearch, Logstash, Kibana) is mandatory.
- Strong scripting and automation skills, with expertise in Python, Bash, or similar languages.
- Experience in Data structures using Elasticsearch Indices.
- Experience in writing Data Ingestion Pipelines using Logstash.
- Experience with infrastructure as code (IaC) and configuration management tools (e.g., Ansible, Terraform).
- Handson and experience with cloud platforms ( AWS preferred) and containerization technologies (e.g., Docker, Kubernetes).
- Good to have Telecom domain expertise but not mandatory
- Strong problem-solving skills and the ability to troubleshoot complex issues in a production environment.
- Excellent communication and collaboration skills.
Accreditation/certifications/licenses
Relevant certifications (e.g., Elastic Certified Engineer) are a plus.