Search by job, company or skills
Senior Application SRE Engineer
Experience Level: Senior
Location : Bangalore
Notice Period : 0-15days
Hybrid Model : 3days in a week
Location : Whitefield
Qualification : Bachelor's degree in Computer Science, Information Technology, or related field preferred.
Mandate Skills : Minimum 8+ years of hands-on experience in Site Reliability Engineering (SRE) Extensive experience with AWS services : EC2, EKS, RDS, S3, Lambda, load balance, IAM, VPC
Configration Tool and IAC tool : Ansible/ Terraform
Scripting : Java/Python/Shell
Hands-on experience with CI/CD pipeline. incident management and Debugg the issue. About Company: Position Overview: The Senior SRE will be responsible for leading initiatives to improve system reliability, automate operational processes, and ensure the scalability and security of our systems.
The ideal candidate will have a strong background in Linux systems, cloud technologies, containerization, and automation, along with a proactive approach to problem-solving and a commitment to continuous improvement.
Roles and Responsibilities: Design and implement automation solutions for infrastructure provisioning, configuration management,promoting consistency and reliability across environments.
Maintenance of CI/CD pipelines using Jenkins,
ensuring efficient deployment processes and integrating quality checks.
Manage the applications using Docker and Kubernetes, focusing on scalability, efficiency, and security.
Solutioning and Maintaining the secure, scalable, and resilient cloud infrastructure on AWS, including performance tuning and cost optimization.
Conduct comprehensive Linux system administration, including performance tuning, security hardening, and troubleshooting.
Develop and maintain Java,Python to automate tasks and integrate systems, enhancing operational efficiency.
Collaborate with development and operations teams to implement SRE principles, fostering a culture of reliability and performance. Monitor system performance, identify bottlenecks, and implement solutions to ensure high availability and optimal user experience.
Lead incident response efforts, minimizing impact and conducting post-mortem analyses to prevent future occurrences.
Mentor junior team members and contribute to the development of best practices and standards within the SRE team.
Must Have Skills : Minimum 8+ years of hands-on experience in Site Reliability Engineering (SRE) or a similar role, with a proven track record of managing large-scale, highly available AWS infrastructure.
Extensive experience with AWS services, including but not limited to EC2, EKS, RDS, S3, Lambda, load balance, IAM, VPC, and Cloud Formation. Proficiency in designing, deploying, and maintaining complex, cloud-native architectures on AWS.
Deep understanding of observability principles and best practices. Hands-on experience with monitoring, logging, tracing, and alerting tools such as New Relic, Grafana, Prometheus, and ELK stack.
Proficiency in Java, Python and Shell scripting for automation tasks and infrastructure management. Experience with configuration management tools like Ansible and Terraform for infrastructure as code (IaC) deployment.
Hands-on experience with CI/CD pipelines using tools like Jenkins for automated build, test, and deployment processes. Should have experience in handling containerisation applications like Kubernetes
. Proven ability to monitor system performance, identify bottlenecks, and optimize resource utilization. Experience in capacity planning, performance tuning, and scaling AWS resources as needed
. Strong troubleshooting skills and experience in incident management and response. Ability to diagnose and resolve complex issues in a timely manner, ensuring minimal impact on service availability and performance.
Excellent interpersonal and communication skills, with the ability to collaborate effectively across cross-functional teams. Experience in fostering a culture of collaboration, knowledge sharing, and continuous improvement.
Ability to lead and mentor junior team members, guiding them in best practices, methodologies, and tools related to SRE and observability.
Good to Have :
AIOps Knowledge: Familiarity with AIOps (Artificial Intelligence for IT Operations) concepts and tools such as machine learning, anomaly detection, and predictive analytics applied to infrastructure monitoring and management. Experience in leveraging AIOps solutions to enhance observability, automate remediation, and optimize performance in cloud environments.
Telecom Domain Experience: Exposure to the telecommunications industry, including knowledge of networking protocols, telecommunications infrastructure, and service delivery platforms. Experience with telecom-specific technologies such as VoIP, LTE, 5G, IMS, and SDN/NFV.
OTT (Over-the-Top) Domain Experience: Understanding of Over-the-Top services and platforms, including streaming media, content delivery networks (CDNs), and video-ondemand (VOD) services. Experience in managing high-volume, high-availability OTT platforms and addressing unique challenges related to content delivery, user experience, and scalability.
Master of Health Science (MHSc), Doctor of Ministry
Date Posted: 21/06/2024
Job ID: 82575335