Site Reliability Engineer

cloudxchange.io

Early Applicant

a month ago
Be among the first 50 applicants

Navi Mumbai, Mumbai, India

Job Description

Job Overview:

The Site Reliability Engineer (SRE) is responsible for maintaining and improving the reliability, availability, and performance of our cloud-based infrastructure and applications. The SRE will work closely with development, Infrastructure management, and security teams to ensure the seamless operation of our services and systems, leveraging automation and best practices to optimize performance and reduce downtime.

Key Responsibilities:

1. Monitoring & Alert Management

AWS Services Monitoring:
Continuously monitor the health and performance of AWS services including EC2, Kubernetes (EKS), Amazon Aurora & RDS, Kafka, and other integrated services.
Utilize tools such as CloudWatch, Prometheus, and Grafana to ensure service uptime and optimal performance.
Application Monitoring:
Monitor and track the performance of applications across the stack, ensuring responsiveness and reliability.
Set up and maintain monitoring dashboards for real-time visibility into application performance.
Alert Management:
Configure and manage alerting mechanisms for critical thresholds, errors, and potential downtimes.
Ensure that alerts are actionable and routed to the appropriate on-call engineers for quick incident response.

2. Incident Response & Troubleshooting

Issue Analysis:
Perform initial diagnosis of issues related to AWS services, Kubernetes clusters, Amazon Aurora & RDS, Kafka, and the application stack (Java, Golang, Node.js, MySQL, PostgreSQL).
Investigate issues to identify the root cause and implement fixes or temporary workarounds as needed.
Log Analysis:
Collect and analyze logs from AWS CloudWatch, application logs, and Kubernetes logs to identify and resolve issues swiftly.
Utilize logging tools and services to gain insights into system behavior and preemptively address potential problems.
Incident Documentation:
Document all incidents, including the steps taken for troubleshooting and the final resolution, to build a comprehensive knowledge base.
Share incident reports with relevant stakeholders and contribute to post-mortem analyses.

3. Root Cause Analysis & Escalation

Root Cause Identification:
Analyze recurring issues to identify the underlying root causes and suggest preventive measures to avoid future occurrences.
Collaborate with development teams to address and resolve systemic issues that affect service reliability.
Escalation Management:
Escalate unresolved issues to higher-level support or development teams, providing detailed incident reports and context to facilitate resolution.
Serve as the point of contact for critical incidents, ensuring effective communication across all involved teams.

4. System Maintenance & Updates

Patch Management:
Monitor and apply updates, patches, and upgrades to AWS services, Kubernetes clusters, and associated applications.
Ensure that all systems are up-to-date and compliant with security and operational standards.
Backup & Recovery:
Ensure regular backups of databases and other critical components.
Periodically verify the integrity of backups and test recovery procedures to ensure data can be restored in the event of an incident.

5. Documentation & Knowledge Management

Knowledge Base Maintenance:
Create and maintain a knowledge base that includes troubleshooting guides, standard operating procedures (SOPs), and best practices.
Ensure documentation is kept up-to-date and accessible to all relevant teams.
Reporting:
Generate regular reports on system performance, incidents, and support team activities.
Use these reports to provide insights into system reliability and areas for improvement.

6. Collaboration & Communication

Cross-Functional Collaboration:
Work closely with Infrastructure team, App development, and security teams to ensure smooth operation and continuous improvement of the application environment.
Participate in cross-functional meetings to provide insights and recommendations for system reliability and performance.

7. Continuous Improvement

Automation:
Identify opportunities to automate repetitive tasks such as monitoring, alerting, and incident response to improve efficiency.
Implement automation scripts and tools to reduce manual intervention and increase system reliability.
Performance Optimization:
Suggest and implement optimizations to improve the performance, reliability, and scalability of the application stack.
Continuously evaluate and refine system architecture to support the growing needs of the business.

Qualifications:

B.E Computers/IT or equivalent
Strong troubleshooting skills with experience in log analysis and incident management.
Excellent communication skills and the ability to work collaboratively across teams.
Familiarity with scripting and automation tools (e.g., Python, Bash, Terraform).
Experience with CI/CD pipelines and infrastructure-as-code (IaC) practices.
Proven experience in AWS cloud services, Kubernetes, and application monitoring tools viz datadog,ELK Stack ,Prometheus, Grafana .
Knowledge of security best practices in cloud environments.
Familiarity with scripting and automation tools (e.g., Python, Bash, Terraform).
Experience with databases like MySQL, PostgreSQL, and NoSQL databases

More Info

Industry:Other

Job Type:Permanent Job

Date Posted: 09/10/2024

Job ID: 95639737

Report Job

About Company

cloudxchange.ioJob Source: www.linkedin.com

Hi , want to stand out? Get your resume crafted by experts.

Similar Jobs

Senior Site Reliability Engineer

ArithaCompany Name Confidential

0-0 yrs

Bengaluru / Bangalore, India

1 months ago

Senior Software Site Reliability Engineer I

Credit KarmaCompany Name Confidential

0-0 yrs

Bengaluru / Bangalore, India

1 months ago

Last Updated: 19-11-2024 08:05:04 PM

Home Jobs in Navi Mumbai Site Reliability Engineer

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Do you want to see more relevant and perfect job for you?

Beware of Scammers

We don’t charge any money for job offers

What it feels like to have

48% more interview calls?