Search by job, company or skills

Traydstream

Site Reliability Engineer (SRE) (Azure Cloud) :T005

Early Applicant
  • 5 months ago
  • Be among the first 50 applicants

Job Description

Reliability and Stability:

  • Own and operate our application stack Azure infrastructure to orchestrate and manage our hosted customer instances of Metabase.
  • Debug runtime issues across the different levels of our application stack and hosting stack.
  • Continuously improve our automated deployments and testing.
  • Carry out all activities pertaining to supporting our Application and Cloud Infrastructure that our platform runs on, including but not limited to monitoring the Application, investigating and resolving Alerts and Outages, configuring the Monitoring/Alerting tooling, investigating external and internal client reported issues and carrying out BAU maintenance activities.
  • Deploy application and infrastructure upgrades and enhancements to UAT and Production environments.
  • Provision new / manage existing UAT and Production Environments.
  • Coordinate and carry out Security Incident Management related to our application and infrastructure in accordance with our Security Incident Management processes.
  • Maintain our SOC2 compliance and security posture.
  • Where necessary, be prepared to work in shifts (early/late, weekends) to provide 24x7 Support for our platform.

Service-Level Objectives (SLOs):

  • Develop and build our internal tooling and automation to manage the lifecycle of a hosted Metabase installation, from purchase to deployment, zero-downtime upgrades, and general operational health.

Automation and Tooling:

  • Continuously improve our automated deployments and testing.
  • Automate EKS and AKS cluster provisioning.
  • Extend our CRDs and Operators.
  • Improve the RDS sharding strategy for our multi-tenant platform.
  • Unify and improve our CI/CD platforms.

Capacity Planning:

  • Continually seek and implement improvements in the environment cost control, automation, rationalizing the estate, and processes.

Collaboration:

  • Collaborate with core application developers on changes to improve our application metrics, deployment speeds, and CI integration.

Performance Optimization:

  • Collaborate with core application developers on changes to improve our application metrics, deployment speeds, and CI integration.

Requirements

Must Haves

  • 2-5 years experience building and operating production infrastructure, ideally on public cloud and Microsoft Azure cloud.
  • Experience supporting business-critical systems (Incident, Change and Problem management process) in a large-scale operations team.
  • Broad knowledge of IT Operations concepts, architecture & information security (ITIL/ Security).
  • Hands-on commercial experience of supporting cloud-based SaaS systems (Microsoft Azure).
  • Experience in setting up EC2, SNS, Database Instances, securing of VPC, implementation of Security Groups, Identity and Access Management, Backups, Restore and Disaster Recovery, and the equivalent technologies on Azure.
  • Hands-on commercial experience in both Linux and Windows systems administration and automation scripting.
  • Hands-on commercial experience managing Kubernetes Clusters
  • Good understanding of DevOps principles (CI/CD, release automation).
  • Knowledge of Clusters, Storage, Backups, Data Export/Import, Monitoring tools and Disaster Recovery.
  • Hands-on commercial experience using a wiki (ideally Confluence) to document processes that comprise our Knowledge Base.
  • Experience with TCP/IP network and various fundamental network services such as DNS, DHCP, SMTP, NTP, telnet, SSH, etc.

Nice to Haves

  • AWS is good to have
  • Ability to read/understand & debug Python and Java.
  • Working experience with MongoDB, MariaSQL and PostgresSQL.
  • Working experience with Application Monitoring tools
  • Practical application of scripting (e.g. Python, cron), to automate repeated tasks.
  • ITIL Foundation Qualified.

More Info

Industry:Other

Function:technology

Job Type:Permanent Job

Skills Required

Login to check your skill match score

Login

Date Posted: 21/06/2024

Job ID: 82588669

Report Job

About Company

Follow

Hi , want to stand out? Get your resume crafted by experts.

Similar Jobs

Lead Cloud Site Reliability Engineer SRE

M amp G plcCompany Name Confidential

Site Reliability Engineer Senior Cloud Engineer

EquifaxCompany Name Confidential
Last Updated: 14-11-2024 03:44:56 PM
Home Jobs in India Site Reliability Engineer (SRE) (Azure Cloud) :T005