Search by job, company or skills

Atlasrtx

Senior Site Reliability Engineer

Early Applicant
  • 4 months ago
  • Be among the first 50 applicants

Job Description

  • Run the production environment by monitoring availability and taking a holistic view of system health
  • Build software and systems to manage platform infrastructure and applications
  • Improve reliability, quality, and time-to-market of our suite of software solutions
  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
  • Provide primary operational support and engineering for multiple large distributed software applications
  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Participate in system design consulting, platform management, and capacity planning
  • Create sustainable systems and services through automation and uplifts
  • Balance feature development speed and reliability with well-defined service level objectives
Have you got what it takes
  • Bachelor s degree in computer science, Engineering, or related field (or equivalent experience).
  • 5-7 years of working experience in a similar role, with a focus on systems engineering, automation, and reliability.
  • Proficiency in at least one programming language (e.g., Python, Go, Java, C#) and experience with scripting languages (e.g., Bash, PowerShell).
  • Deep understanding of cloud computing platforms (e.g., AWS), the working and reliability constraints of some of the prominent services (e.g., EC2, ECS, Lambda, DynamoDB etc)
  • Experience with infrastructure as code tools such as CloudFormation, Terraform.
  • Deep understanding of CI/CD concepts and experience with CI/CD tools such as Jenkins, GitLab CI/CD, or CircleCI.
  • Strong knowledge of containerization technologies (e.g., Docker, Kubernetes) and microservices architecture.
  • Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack, Cloudwatch).
  • Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems.
  • Experience of Incident management and blameless postmortems that includes driving the incident response efforts during outages and other critical incidents, resolution, and communication in a cross-functional team setup.
  • Handson experience of working with large Kubernetes Cluster. Certification will be an added plus.
  • Working experience of Grafana Observability Suite (Loki, Mimir, Tempo).
  • Administration and/or development experience of standard monitoring and automation tools such as Splunk, Datadog, Pagerduty Rundeck.
  • Familiarity with configuration management tools like Ansible, Puppet, or Chef.
  • Certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or equivalent.
You will have an advantage if you also have:
  • Experience/knowledge of other cloud platform will be added advantage

More Info

Industry:Other

Function:technology

Job Type:Permanent Job

Skills Required

Login to check your skill match score

Login

Date Posted: 26/06/2024

Job ID: 83070041

Report Job

About Company

Hi , want to stand out? Get your resume crafted by experts.

Similar Jobs

Senior Software Site Reliability Engineer

ExperianCompany Name Confidential

Senior Site Reliability Engineer

EarnInCompany Name Confidential
Last Updated: 14-07-2024 09:26:28 AM
Home Jobs in Pune Senior Site Reliability Engineer