- As a member of 24/7 NOC team, oversee the whole platform ensuring stability and performance
- and monitor production releases based on the complexity and risk assessment.
- Make it easy for everyone to create, consume, manage, and scale reliable cloud production services
- to achieve more
- Work independently or collaboratively on SailPoint SaaS services to design, develop, and improve
- end-to-end reliability and maintainability for all services
- Coach engineering teams on observability best practices such as setting up well defined Service Level
- Objectives (SLOs).
- Lead engineering teams through post-incident reviews to define effective preventive actions
- Collaborate effectively with developers to increase system reliability through short-term embedding
- programs
- Enable our engineering teams to scale our enterprise operations by providing guidance, best practices
- and support as part of an SRE Centre of Excellence
- Manage cross-functional requirements working with Engineering, Product, Services, and other
- departments
- Develop and implement automation tools and processes to streamline operations and enhance system
- performance.
- Be a mentor of quality for design reviews, code, test cases, automation, observability, root cause
- analysis, and self-healing
- Influence architectural design, implementation, consolidation, and simplification for global scale
- Focuses on expanding own skills and looking at improving their teammates skills
- Drive operational excellence to deliver frictionless operation, happy on call, and optimal customer
- experience
Requirements
- 7-10 years of experience working in an agile software development, infrastructure operations, or
- application management with SaaS software or cloud service provider organizations.
- 5+ years of experience using NOC or SRE tactics to monitor Engineering production operations
- supporting a highly available environment for SaaS software or cloud service provider.
- Experience with cloud infrastructure environments, preferably AWS, and Infrastructure as code.
- Experience with containerization technology and/or Kubernetes
- Experience with metrics, tracing, and logging observability tools such as Prometheus, Grafana,
- Honeycomb, Jaeger, and Kibana
- Experience with incident management, including conducting incident reviews
- Good to have experience with programming languages (Java, Python, Go, etc).
- Strong understanding of Linux, software development, systems, networking, and Cloud concepts Experience working with remote teams (US time zones).
- Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers
- who are not direct reports.
- Have excellent communication skills - English fluency
Preferred :
Bachelors degree in Computer Science or other technical discipline