Responsibilities:
- Engage in and improve the whole lifecycle of servicesfrom inception and design through deployment, operation, and refinement
- Support capacity planning, availability, scalability, security and latency considerations for new infrastructure and service provisioning as appropriate
- Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.
- Partner with other SREs to bring best practices or learnings from across the organization to them
- Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency
- Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence
- Maintain infrastructure (infrastructure as code) and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages and security threats in Development, UAT, Staging and Production environments
- Practice sustainable incident response and blameless postmortems
- Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems
- Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, and production readiness reviews to run the platform
- Step back to observe patterns and develop innovative tools and automation to eliminate or minimize menial tasks. Use those learnings to drive the best operational practices
- Develop and maintain solution and operational documentation and designs for all infrastructure and services within the scope of SRE
- Preserve operational visibility and response capabilities fixing and improving our dashboards, alerts, and automation
- Maintain operational uptime and reliability by participating in triage and issue support calls for mission critical systems
- Partner with business and technical product owners to set SLOs / SLIs / error budgets to manage reliability of infrastructure and applications
Required Qualifications:
- Software Engineering, Computer Science equivalent, or STEM degree (Desirable) or commensurate experience
- 6+ years of total software engineering experience using Kubernetes, AWS Native components/Azure/GCP, CloudWatch, Dynatrace
- 3+ years of support a production system on a DevOps team
- 2+ years of experience Architecting using AWS Cloud
- Strong experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications using Kubernetes, AWS Native components, CloudWatch, Dynatrace
- Can mentor team of less experienced Full-stack developers who are learning the AWS environment.
- Proficient in one or more of the following scripting languages: JavaScript, Nodejs, Python, Maven, Ansible, Bash, etc.
- Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible, GitLab CI
- Understanding of standard networking protocols and components such as HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies
- Experience in Serverless Application Framework
- Experience in containerized workloads and management platforms such as Docker or Kubernetes
- Familiarity with distributed systems is a plus including Microservices
- Experience in Infrastructure automation tools such as CloudFormation, Terraform
- Understanding of CI/CD processes and experience with deployment automation tools such as Code Pipeline, Code Deploy, Jenkins, Bamboo
- Strong debugging, troubleshooting, and problem-solving skills
- Effective communication, collaboration & negotiation skills with the ability to interface with various business units and third parties
- Must have the ability to listen to customers and colleagues; convey ideas effectively; prepare written documentation
- Experience liaising with developers, operations staff and third-party resources
- Experience with API integration projects
- Proven history of toil elimination by leveraging automation
- Strong background using tools like PagerDuty for managing incidents
- Strong experience with monitoring and alerting systems like Prometheus, Grafana, Datadog.
Preferred Qualifications:
- AWS Certified DevOps Engineer or equivalent cloud professional SRE certifications.
- A mindset focused on automation, measurement and efficiency.