Own and operate our application stack Azure infrastructure to orchestrate and manage our hosted customer instances of Metabase.
Debug runtime issues across the different levels of our application stack and hosting stack.
Continuously improve our automated deployments and testing.
Carry out all activities pertaining to supporting our Application and Cloud Infrastructure that our platform runs on, including but not limited to monitoring the Application, investigating and resolving Alerts and Outages, configuring the Monitoring/Alerting tooling, investigating external and internal client reported issues and carrying out BAU maintenance activities.
Deploy application and infrastructure upgrades and enhancements to UAT and Production environments.
Provision new / manage existing UAT and Production Environments.
Coordinate and carry out Security Incident Management related to our application and infrastructure in accordance with our Security Incident Management processes.
Maintain our SOC2 compliance and security posture.
Where necessary, be prepared to work in shifts (early/late, weekends) to provide 24x7 Support for our platform.
Service-Level Objectives (SLOs):
Develop and build our internal tooling and automation to manage the lifecycle of a hosted Metabase installation, from purchase to deployment, zero-downtime upgrades, and general operational health.
Automation and Tooling:
Continuously improve our automated deployments and testing.
Automate EKS and AKS cluster provisioning.
Extend our CRDs and Operators.
Improve the RDS sharding strategy for our multi-tenant platform.
Unify and improve our CI/CD platforms.
Capacity Planning:
Continually seek and implement improvements in the environment cost control, automation, rationalizing the estate, and processes.
Collaboration:
Collaborate with core application developers on changes to improve our application metrics, deployment speeds, and CI integration.
Performance Optimization:
Collaborate with core application developers on changes to improve our application metrics, deployment speeds, and CI integration.
Requirements
Must Haves
2-5 years experience building and operating production infrastructure, ideally on public cloud and Microsoft Azure cloud.
Experience supporting business-critical systems (Incident, Change and Problem management process) in a large-scale operations team.
Broad knowledge of IT Operations concepts, architecture & information security (ITIL/ Security).
Hands-on commercial experience of supporting cloud-based SaaS systems (Microsoft Azure).
Experience in setting up EC2, SNS, Database Instances, securing of VPC, implementation of Security Groups, Identity and Access Management, Backups, Restore and Disaster Recovery, and the equivalent technologies on Azure.
Hands-on commercial experience in both Linux and Windows systems administration and automation scripting.