Job Description
Job Purpose
This is an exciting opportunity for a Site Reliability Engineer in the Consumer SRE Team at IMT, to provide secure, resilient, scalable and maintainable services for mortgage borrowers and lenders. ICE Mortgage Technology (IMT) is a division of Intercontinental Exchange (ICE), which operates numerous financial and commodity marketplaces and exchanges, including the New York Stock Exchange (NYSE). IMT's Engineering Excellence Center in Pune is a key element of this mission.
Automation is a big part of what we do - we use infrastructure-as-code within our hybrid cloud to bring stability and scalability to Windows, Linux, Docker and Serverless applications in AWS, On-Prem and Azure environments. We reduce toil through scripting and automation of repetitive tasks. You will collaborate with Developers to deliver robust services; build actionable alerts to detect / avoid incidents and to detect performance bottlenecks; as well as automation to remediate issues.
Responsibilities
- Employ deep troubleshooting skills to improve the availability, performance, and security of Ellie Mae Services.
- Ensure services are designed with 24/7 availability and operational readiness and rigor
- Implement proactive monitoring, alerting, trend analysis and self-healing systems
- Define and measure KPIs and SLOs
- Build automated deployments, automated tests, and operational tools
- Participate in on-call rotation for Production support
- Collaborate with Product and Support teams to plan and deploy product releases
- Partner with other SREs and lead by example
Knowledge And Experience
- 5+ years of Application/Systems engineering in 24x7 Production Services environments
- BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
- Fluency with one or more current generation scripting language used by SRE/DevOps professionals (Powershell, Python, Perl, PHP, Ruby) + Java/.NET development
- Excellent troubleshooter, utilizing a systematic problem-solving approach
- Demonstrate the ability to lead Incident Response and root cause analysis (RCA)
- Experience running a SaaS application in a public cloud, on-prem or hybrid cloud environment
- Additional credit for
- Proficiency in Windows and on-prem environments
- Experience with Continuous Integration and Continuous Delivery concepts.
- Automation in RunDeck or Jenkins
- Infrastructure-as-code or Configuration Management, utilizing tools like Terraform, CloudFormation or Chef/SaltStack/Puppet/DSC
- Containers/Docker/Micro-Services
Schedule
This role offers work from home flexibility of one day per week.