About The Team/Role
Working closely with the Platform Operations Lead, the Site Reliability Engineer is responsible for building out WEX's Travel engineering solutions and operational problems with a focus on optimizing existing systems, building infrastructure and eliminating work through automation in an Agile environment.
How you'll make an impact
- Engage in and improve the whole lifecycle of servicesfrom inception and design through deployment, operation and refinement
- Support capacity planning, availability, scalability, security and latency considerations for new infrastructure and service provisioning as appropriate
- Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency
- Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence
- Maintain infrastructure and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages and security threats in Development, UAT, Staging and Production environments
- Practice sustainable incident response and blameless postmortems
- Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems
- Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, and production readiness reviews to run the platform
- Step back to observe patterns and develop innovative tools and automation to eliminate or minimize menial tasks. Use those learnings to drive the best operational practices
- Develop and maintain solution and operational documentation and designs for all infrastructure and services within the scope of SRE
- Preserve operational visibility and response capabilities fixing and improving our dashboards, alerts, and automation
- Take part in on-call rotation as part of the Platform Operations team supporting the Wex Travel Platform
Experience You'll Bring
- Proficient in one or more of the following scripting languages: JavaScript, Nodejs, Python, PowerShell, Bash, etc
- 2 years experience working with public cloud platforms, Azure preferable
- Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible etc.
- Understanding of standard networking protocols and components such as HTTP, DNS, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies
- Understanding of Serverless Application Framework
- Experience in containerised workloads and management platforms such as Docker or Kubernetes
- Familiarity with distributed systems is a plus including Microservices
- Experience in Infrastructure automation tools such as Cloudformation, Terraform
- Understanding of CI/CD processes and experience with deployment automation tools such as CodePipeline, CodeDeploy, Jenkins, Bamboo
- Strong debugging, troubleshooting, and problem-solving skills
- Effective communication, collaboration & negotiation skills with the ability to interface with various business units and third parties
- Experience liaising with developers, operations staff and third-party resources
- Understanding of API integration
- JIRA & Confluence (Desirable)
- Software Engineering or Computer Science equivalent degree (Desirable)