Job Description
Site Reliability Engineer (SRE) Job Overview As a Site Reliability (SRE) / DevOps Engineer, you will be responsible for the availability, automation, performance, efficiency, and scaling, monitoring and emergency response for any incidents / issues in Applications. You will use your deep understanding of platforms, architecture, people, systems, and processes to both establish and continuously improve SLIs and SLOs for uptime, performance, deployment, monitoring, and troubleshooting. You are interested in setting direction and leading the day to day processes that shape our vision for reliability Responsibilities and Duties Understanding and analyzing the business requirements and designing, developing and implementing solutions. Maintain and support the product and data systems: proactively monitor events, investigate issues, analyze solutions, and drive problems through to resolution. Troubleshooting large-scale distributed systems through a systematic problem-solving approach. Define requirements and develop tools and reporting as needed by projects and operations. Participate in 24x7 on-call rotation for after-hours emergencies. Use operational tools and monitoring platforms to gain in-depth knowledge, understanding, and ongoing monitoring of system availability, performance, and capacity. Implement alerting strategy that makes alerts actionable and unique. Run incident post-mortems and follow-through to ensure issues are resolved to satisfaction. Drive continuous improvement and innovation within the team. A sense of ownership, initiative and drive. Mandatory Skills and Qualifications Bachelor's degree in Computer Science, or a related technical field involving software or systems engineering, or equivalent practical experience. Hands on experience with Linux / Windows servers. Hands on experience in developing scripts (Shell Scripting / Python). Hands on experience with Docker / Kubernetes. Experience working with cloud platforms (Azure / GCP / AWS). Experience in supporting applications on production environment. Excellent communication skills. Good to have skills Experience with Git, Azure DevOps and Ansible. Knowledge of Web & Application servers, Databases (SQL/NoSQL), Storage, Networking. Knowledge of monitoring tools and strategy. Knowledge of ITIL processes service requests, incidents and problem management.