This is a remote position.
We are seeking a skilled and motivated Site Reliability Engineer (SRE) to join our dynamic team. The ideal candidate will have a minimum of 3 years of experience in site performance, availability, and backups, along with a strong background in cost optimization, resource budget management, and SQL database management. This role requires close collaboration with the Operations Manager and a solid understanding of Azure and SaaS products.
Key Responsibilities:
Site Performance & Availability:
- Ensure the performance, availability, and reliability of our systems and services.
- Monitor, diagnose, and resolve performance issues to maintain optimal service levels.
- Understand application deployment, metrics to monitor, and errors to look for.
Backups & Recovery:
- Manage backup solutions and disaster recovery processes to ensure data integrity and availability.
- Regularly test and validate backup procedures. Develop and maintain DR execution plans and playbooks.
Cost Optimization:
- Implement and oversee strategies for cost optimization, including resource usage monitoring and cost-effective solutions.
- Work to balance performance with cost efficiency.
Resource Budget Management:
- Collaborate with the Operations Manager to manage and optimize the resource budget.
- Track and report on resource usage and costs to ensurealignment with budgetary constraints.
Azure Expertise:
- Leverage your knowledge of Azure to deploy, manage, and monitor cloud-based applications and services.
- Ensure the efficient use of Azure resources and services.
Database Administration:
- Manage and optimize SQL databases, including performance tuning, backup, and recovery.
- Set up and maintain dashboards/alerts to detect database issues proactively.
Performance Measurement Systems:
- Set up performance measurement systems and track key metrics to ensure system reliability and efficiency.
Collaboration:
- Work closely with the Operations Manager and other teams to align on operational goals and strategies.
- Provide insights and recommendations for improving system reliability and efficiency.