Responsibilities :
Evaluate and ensure availability of components within their teams and identify how to bring all services within SLO (99.XX)
Monitor systems for implemented automation and set SLI/SLOs along with respective stakeholders.
Implementation of observability platform
Review all ownership data and ensure it is current and complete.
Review volume and accuracy of bugs assigned to the team and identify opportunities to improve automated triage.
Identify CFBT (Customer Flow Based Testing) eligible flows, develop CFBT tests and train the team on how to write and maintain them.
Lead post postmortems for any P1 or greater incidents during the rotation. Train the team on distributed problem management process.
Operations and Design Consultation for driving high reliability.
Emergency Incident Response with action-oriented postmortem/RCA/Incident debriefs.
Driving continuous improvement through toil reduction and automation.
Application Performance and availability analysis
Technical and Professional Requirements:
Technology/Programming Language
React
Node.js
Java
JS
Python, Angular, Typescript, HTML 5
Event Streaming (Kafka)
Shell scripting / PowerShell Hosting/Technical Environment
AWS Technologies
Kubernetes
Docker Containers
CI/CD, Jenkins pipelines
Basics of Content delivery networks (CDN and caching concepts)
Artifactory / Container Registry
Web Server Gateways
REST API / API Endpoints