Senior DevOps Engineer - Logging Metrics and Monitoring (Open)

athenahealth

Early Applicant

5 months ago
Be among the first 50 applicants

Exp: 5-7 Years

Full time

Bengaluru / Bangalore, India

Job Description

We are looking for a Senior Site Reliability Engineer to join our Logging Metrics & Monitoring team within our Cloud Infrastructure Engineering division. Ultimately your work will focus on improving the performance and efficiency of our teams by building world-class tools and automated workflows which will produce improved outcomes for our business.

The Team:

Our services are highly visible and used every day by teams all across Athena to develop, monitor, troubleshoot and scale their web services. The team is responsible for collecting and hosting large volumes of metrics and log data; we do this by running large scale distributed, fault tolerant systems to collect and host all this data. Our team has a big impact on productivity of hundreds of developers all across athena

In a typical week, our engineers work on problems ranging from tuning performance, scaling services to debugging hard problems. They will introduce new features and partner with development teams to solve their pressing monitoring and logging issues. We work in an agile, sprint-based schedule running daily standups and work in both the private and public cloud

Job Responsibilities

Automate deployment of Logging and Metrics services using configuration management with puppet
Work on production incidents and resolve them using your Linux administration and engineering skills
Develop metrics dashboards, alert criteria to monitor and scale services
Work on weeklong on call in rotation alongside other team members
Support development teams to refine their logging and metrics collection
Ability to handle on-call rotations every several weeks

Typical Qualifications

Prior experience of 5 7 years in a production environment with exposure to AWS and On-Prem Infrastructure and their corresponding troubleshooting methodologies, this includes AWS, Kubernetes, On-Prem Infrastructure.
Hands on experience with configuration management using Puppet, Chef or Ansible
Sysadmin, Devops skills for running services in Linux environment
Experience operating production services in Linux environment and serving on call rotations
Experience with multiple of: Bash scripting, Ruby, Python, Ruby, Perl, C++, Java, Golang
Develop deployment templates for services in the public cloud using cloudformation, terraform
Ability to be flexible and change with environment and business demands

Additional Qualifications

Solid understanding of Linux operating system
Experience managing large server fleets in production
Experience with performance analysis of services
Experience with relevant technologies: fluentd, kafka, elasticsearch, graphite, clickhouse, terraform, prometheus, grafana, graylog, AWS cloudformation, docker containers, jenkins, load balancers, git.
Experience with tcpdump, wireshark, or other protocol analyzers