We are looking for a Senior Site Reliability Engineer to join our Logging Metrics & Monitoring team within our Cloud Infrastructure Engineering division. Ultimately your work will focus on improving the performance and efficiency of our teams by building world-class tools and automated workflows which will produce improved outcomes for our business.
The Team:
Our services are highly visible and used every day by teams all across Athena to develop, monitor, troubleshoot and scale their web services. The team is responsible for collecting and hosting large volumes of metrics and log data; we do this by running large scale distributed, fault tolerant systems to collect and host all this data. Our team has a big impact on productivity of hundreds of developers all across athena
In a typical week, our engineers work on problems ranging from tuning performance, scaling services to debugging hard problems. They will introduce new features and partner with development teams to solve their pressing monitoring and logging issues. We work in an agile, sprint-based schedule running daily standups and work in both the private and public cloud
Job Responsibilities
- Automate deployment of Logging and Metrics services using configuration management with puppet
- Work on production incidents and resolve them using your Linux administration and engineering skills
- Develop metrics dashboards, alert criteria to monitor and scale services
- Work on weeklong on call in rotation alongside other team members
- Support development teams to refine their logging and metrics collection
- Ability to handle on-call rotations every several weeks
Typical Qualifications
- Prior experience of 5 7 years in a production environment with exposure to AWS and On-Prem Infrastructure and their corresponding troubleshooting methodologies, this includes AWS, Kubernetes, On-Prem Infrastructure.
- Hands on experience with configuration management using Puppet, Chef or Ansible
- Sysadmin, Devops skills for running services in Linux environment
- Experience operating production services in Linux environment and serving on call rotations
- Experience with multiple of: Bash scripting, Ruby, Python, Ruby, Perl, C++, Java, Golang
- Develop deployment templates for services in the public cloud using cloudformation, terraform
- Ability to be flexible and change with environment and business demands
Additional Qualifications
- Solid understanding of Linux operating system
- Experience managing large server fleets in production
- Experience with performance analysis of services
- Experience with relevant technologies: fluentd, kafka, elasticsearch, graphite, clickhouse, terraform, prometheus, grafana, graylog, AWS cloudformation, docker containers, jenkins, load balancers, git.
- Experience with tcpdump, wireshark, or other protocol analyzers