About the Team/Role
Working closely with the Platform Operations Lead, the Site Reliability Engineer is responsible for building out WEX's Travel engineering solutions and operational problems with a focus on optimizing existing systems, building infrastructure and eliminating work through automation in an Agile environment.
How you'll make an impact
Engage in and improve the whole lifecycle of servicesfrom inception and design through deployment, operation and refinement
Support capacity planning, availability, scalability, security and latency considerations for new infrastructure and service provisioning as appropriate
Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency
Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence
Maintain infrastructure and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages and security threats in Development, UAT, Staging and Production environments
Practice sustainable incident response and blameless postmortems
Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems
Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, and production readiness reviews to run the platform
Step back to observe patterns and develop innovative tools and automation to eliminate or minimize menial tasks. Use those learnings to drive the best operational practices
Develop and maintain solution and operational documentation and designs for all infrastructure and services within the scope of SRE
Preserve operational visibility and response capabilities fixing and improving our dashboards, alerts, and automation
Take part in on-call rotation as part of the Platform Operations team supporting the Wex Travel Platform
Experience you'll bring
Proficient in one or more of the following scripting languages: JavaScript, Nodejs, Python, PowerShell, Bash, etc
2 years experience working with public cloud platforms, Azure preferable
Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible etc.
Understanding of standard networking protocols and components such as HTTP, DNS, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies
Understanding of Serverless Application Framework
Experience in containerised workloads and management platforms such as Docker or Kubernetes
Familiarity with distributed systems is a plus including Microservices
Experience in Infrastructure automation tools such as Cloudformation, Terraform
Understanding of CI/CD processes and experience with deployment automation tools such as CodePipeline, CodeDeploy, Jenkins, Bamboo
Strong debugging, troubleshooting, and problem-solving skills
Effective communication, collaboration & negotiation skills with the ability to interface with various business units and third parties
Experience liaising with developers, operations staff and third-party resources
Understanding of API integration
JIRA & Confluence (Desirable)
Software Engineering or Computer Science equivalent degree (Desirable)