o-own critical production service designs to ensure high reliability is achievable and measurable
rive reliability and observability improvements in the services within the engineering verticals
Using monitoring and telemetry data, help teams make informed decisions on where reliability challenges may exist and help design and build solutions to improve them
Build and improve internal tools and automation software to make maintaining production services easier and safer
Lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others
Developing Infrastructure as a Code.
You will build SRE dashboards from SLIs to measure SLO adherence
Define (from design to implementation details) necessary auto-healing and fault-tolerant systems
Point of contact for production application issues, working closely with engineering leadership
Requirements:
6+ years of experience in an Infrastructure, SRE, DevOps, CloudOps role
Experience programming in one or more of the following: C#, Java, Python, .Net, NodeJS, Go,
Experience with Terraform, Ansible, or any similar programming language
Experience with Azure cloud technology
Experience with cloud-performant microservices and event-driven architectures
Experience with Kubernetes administration is an added advantage.
Understanding of information security concepts and terminology
Distributed monitoring experience: logging, metrics, tracing, etc.
Strong knowledge of software development methodologies and passion for creating high-standard tool sets for infrastructure-as-code
Ability to analyze problems quickly and find suitable solutions based on available resources
A proactive and open-minded individual with a clean-cut client focus and structured approach