Job Overview:
The Site Reliability Engineer (SRE) is responsible for maintaining and improving the reliability, availability, and performance of our cloud-based infrastructure and applications. The SRE will work closely with development, Infrastructure management, and security teams to ensure the seamless operation of our services and systems, leveraging automation and best practices to optimize performance and reduce downtime.
Key Responsibilities:
1. Monitoring & Alert Management
- AWS Services Monitoring:
- Continuously monitor the health and performance of AWS services including EC2, Kubernetes (EKS), Amazon Aurora & RDS, Kafka, and other integrated services.
- Utilize tools such as CloudWatch, Prometheus, and Grafana to ensure service uptime and optimal performance.
- Application Monitoring:
- Monitor and track the performance of applications across the stack, ensuring responsiveness and reliability.
- Set up and maintain monitoring dashboards for real-time visibility into application performance.
- Alert Management:
- Configure and manage alerting mechanisms for critical thresholds, errors, and potential downtimes.
- Ensure that alerts are actionable and routed to the appropriate on-call engineers for quick incident response.
2. Incident Response & Troubleshooting
- Issue Analysis:
- Perform initial diagnosis of issues related to AWS services, Kubernetes clusters, Amazon Aurora & RDS, Kafka, and the application stack (Java, Golang, Node.js, MySQL, PostgreSQL).
- Investigate issues to identify the root cause and implement fixes or temporary workarounds as needed.
- Log Analysis:
- Collect and analyze logs from AWS CloudWatch, application logs, and Kubernetes logs to identify and resolve issues swiftly.
- Utilize logging tools and services to gain insights into system behavior and preemptively address potential problems.
- Incident Documentation:
- Document all incidents, including the steps taken for troubleshooting and the final resolution, to build a comprehensive knowledge base.
- Share incident reports with relevant stakeholders and contribute to post-mortem analyses.
3. Root Cause Analysis & Escalation
- Root Cause Identification:
- Analyze recurring issues to identify the underlying root causes and suggest preventive measures to avoid future occurrences.
- Collaborate with development teams to address and resolve systemic issues that affect service reliability.
- Escalation Management:
- Escalate unresolved issues to higher-level support or development teams, providing detailed incident reports and context to facilitate resolution.
- Serve as the point of contact for critical incidents, ensuring effective communication across all involved teams.
4. System Maintenance & Updates
- Patch Management:
- Monitor and apply updates, patches, and upgrades to AWS services, Kubernetes clusters, and associated applications.
- Ensure that all systems are up-to-date and compliant with security and operational standards.
- Backup & Recovery:
- Ensure regular backups of databases and other critical components.
- Periodically verify the integrity of backups and test recovery procedures to ensure data can be restored in the event of an incident.
5. Documentation & Knowledge Management
- Knowledge Base Maintenance:
- Create and maintain a knowledge base that includes troubleshooting guides, standard operating procedures (SOPs), and best practices.
- Ensure documentation is kept up-to-date and accessible to all relevant teams.
- Reporting:
- Generate regular reports on system performance, incidents, and support team activities.
- Use these reports to provide insights into system reliability and areas for improvement.
6. Collaboration & Communication
- Cross-Functional Collaboration:
- Work closely with Infrastructure team, App development, and security teams to ensure smooth operation and continuous improvement of the application environment.
- Participate in cross-functional meetings to provide insights and recommendations for system reliability and performance.
7. Continuous Improvement
- Automation:
- Identify opportunities to automate repetitive tasks such as monitoring, alerting, and incident response to improve efficiency.
- Implement automation scripts and tools to reduce manual intervention and increase system reliability.
- Performance Optimization:
- Suggest and implement optimizations to improve the performance, reliability, and scalability of the application stack.
- Continuously evaluate and refine system architecture to support the growing needs of the business.
Qualifications:
- B.E Computers/IT or equivalent
- Strong troubleshooting skills with experience in log analysis and incident management.
- Excellent communication skills and the ability to work collaboratively across teams.
- Familiarity with scripting and automation tools (e.g., Python, Bash, Terraform).
- Experience with CI/CD pipelines and infrastructure-as-code (IaC) practices.
- Proven experience in AWS cloud services, Kubernetes, and application monitoring tools viz datadog,ELK Stack ,Prometheus, Grafana .
- Knowledge of security best practices in cloud environments.
- Familiarity with scripting and automation tools (e.g., Python, Bash, Terraform).
- Experience with databases like MySQL, PostgreSQL, and NoSQL databases