Site Reliability Engineer (SRE) (Azure Cloud) :T005

Early Applicant

Exp: 2-5 Years

Job Description

Reliability and Stability:

Own and operate our application stack Azure infrastructure to orchestrate and manage our hosted customer instances of Metabase.
Debug runtime issues across the different levels of our application stack and hosting stack.
Continuously improve our automated deployments and testing.
Carry out all activities pertaining to supporting our Application and Cloud Infrastructure that our platform runs on, including but not limited to monitoring the Application, investigating and resolving Alerts and Outages, configuring the Monitoring/Alerting tooling, investigating external and internal client reported issues and carrying out BAU maintenance activities.
Deploy application and infrastructure upgrades and enhancements to UAT and Production environments.
Provision new / manage existing UAT and Production Environments.
Coordinate and carry out Security Incident Management related to our application and infrastructure in accordance with our Security Incident Management processes.
Maintain our SOC2 compliance and security posture.
Where necessary, be prepared to work in shifts (early/late, weekends) to provide 24x7 Support for our platform.

Service-Level Objectives (SLOs):

Develop and build our internal tooling and automation to manage the lifecycle of a hosted Metabase installation, from purchase to deployment, zero-downtime upgrades, and general operational health.

Automation and Tooling:

Capacity Planning:

Continually seek and implement improvements in the environment cost control, automation, rationalizing the estate, and processes.

Collaboration:

Collaborate with core application developers on changes to improve our application metrics, deployment speeds, and CI integration.

Performance Optimization:

Collaborate with core application developers on changes to improve our application metrics, deployment speeds, and CI integration.

Requirements

Must Haves

2-5 years experience building and operating production infrastructure, ideally on public cloud and Microsoft Azure cloud.
Experience supporting business-critical systems (Incident, Change and Problem management process) in a large-scale operations team.
Broad knowledge of IT Operations concepts, architecture & information security (ITIL/ Security).
Hands-on commercial experience of supporting cloud-based SaaS systems (Microsoft Azure).
Experience in setting up EC2, SNS, Database Instances, securing of VPC, implementation of Security Groups, Identity and Access Management, Backups, Restore and Disaster Recovery, and the equivalent technologies on Azure.
Hands-on commercial experience in both Linux and Windows systems administration and automation scripting.
Hands-on commercial experience managing Kubernetes Clusters
Good understanding of DevOps principles (CI/CD, release automation).
Knowledge of Clusters, Storage, Backups, Data Export/Import, Monitoring tools and Disaster Recovery.
Hands-on commercial experience using a wiki (ideally Confluence) to document processes that comprise our Knowledge Base.
Experience with TCP/IP network and various fundamental network services such as DNS, DHCP, SMTP, NTP, telnet, SSH, etc.

Nice to Haves

AWS is good to have
Ability to read/understand & debug Python and Java.
Working experience with MongoDB, MariaSQL and PostgresSQL.
Working experience with Application Monitoring tools
Practical application of scripting (e.g. Python, cron), to automate repeated tasks.
ITIL Foundation Qualified.