Job Description
In this essential position, your responsibilities will include closely collaborating with diverse teams across the organization to fortify our online systems, ensuring they are not only robust and scalable but also equipped to efficiently manage the complexities of a global customer base. Your expertise in site reliability will be crucial in driving ongoing enhancements to our technology landscape. This continuous improvement effort is vital to maintaining Ford's leadership in innovation within the automotive industry, helping us set standards in digital commerce and customer satisfaction. Your contributions will directly impact the smooth operation and evolutionary growth of our eCommerce capabilities, aligning with Ford's commitment to excellence and innovation
Responsibilities
As a Site Reliability Engineer, your responsibilities will include:
- Participating in 24x7 on-call production support rotations and handling incident response to minimize disruptions.
- Continuously monitoring the availability, reliability, and performance of systems, platforms, and applications, maintaining a holistic view of system health.
- Regularly review key site technical metrics such as transactions errors, logging, response times, caching strategies, conversion/bounce rates, capacity & resource utilization.
- Providing primary operational and engineering support for multiple large, distributed software applications.
- Proactively identify stability risks & work with engineering leadership to establish appropriate mitigation plans.
- Using automation tools, scripts, and processes to reduce or eliminate repetitive tasks, thereby improving the support provided by Site Reliability Engineering.
- Creating or modifying terraform files according to Ford formats to develop new monitoring dashboards and alert policies.
- Collaborating with engineering and architecture teams to evaluate and identify optimal cloud solutions, focusing on scalability, high-performance, and security.
- Gathering and analyzing metrics from operating systems and applications to assist in performance tuning and fault finding.
- Measuring and optimizing system performance continuously to exceed customer needs and advance capabilities.
- Troubleshooting and resolving issues related to full stack websites, cloud platforms, and infrastructure.
- Working closely with developers, testers, and business stakeholders to ensure the delivery of high-quality solutions, balancing feature development speed and reliability with well-defined service-level objectives.
- Ensuring compliance with security and regulatory standards, implementing and maintaining disaster recovery processes.
- Providing technical guidance and mentorship to other team members.
These responsibilities ensure the stability, efficiency, and continuous improvement of Ford Motor Company's eCommerce solutions, aligning with the organization's high standards and innovative approach.
Qualifications
4+ years SRE experience
- Ability to work effectively in a remote/virtual work setting with other global team members.
- Effectively work with cross-functional teams across the organization inside and outside of the technology and software organization
- Ability to dissect problems and explore them from different angles to find the most efficient solutions.
- Staying composed under pressure and bouncing back from setbacks quickly, maintaining focus on achieving system reliability.
- Keen attention to specifics to catch and address small issues before they escalate into larger problems.
- A strong desire to understand how things work and a willingness to explore and implement new technologies and methodologies.
- Flexibility in handling unexpected challenges and changes in technology or project directions.
- Taking initiative to prevent problems before they occur and continuously seeking improvements in system performance.
- Confidence and ability to make quick decisions during critical situations to prevent or minimize disruptions.
- Understanding and considering team members perspectives and challenges, fostering a supportive and inclusive environment.
- Clear and effective communication skills, capable of conveying complex information in a straightforward manner and engaging with both technical and non-technical stakeholders.
- Taking responsibility for the systems and the team, ensuring reliability, and being accountable for the outcomes.
- Commitment to the development of team members, providing guidance and feedback to help them grow in their professional capacities.
- Encouraging a collaborative team environment where ideas and solutions are shared openly and where each member's contribution is valued.
- Motivating the team to strive for excellence, pushing the boundaries of what is possible, and inspiring innovation through leadership.
- 5 - 6 years experience with JAVA, J2EE, NoSQL/SQL Datastore, Spring Boot, GCP/AWS/Azure & Docker/K8 in Maintenance and Development of multi-tier applications.
- Understanding of RESTful APIs and microservices platform
- 4 - 5 Years of experience with any of APM and other monitoring tools such as Dynatrace, New Relic, ELK, Splunk, Prometheus, Sensu, Nagios, Kafka, DataDog, PagerDuty.
- Strong experience with product & development teams to establish error budgets by identifying the right SLOs (Service level objective), SLIs (Service level indicators), KPIs (Key performance indicators) and effectively drive the use of the budget to ensure maximum domain availability/uptime.
- Experience in solving complex architecture/design & business problems, work to simplify, optimize, remove bottlenecks, etc.
- Architect, design & develop automation experience to reduce toil, improve recoverability, availability, latency & scalability of supported applications with understanding of MTTD (Mean Time to Detection) & MTTR (Mean Time to Resolution)
- Ability to quickly diagnose and resolve issues in high-pressure situations.
- Strong verbal and written communication skills to effectively collaborate with cross-functional teams and articulate technical concepts to non-technical stakeholders.
- Experience in leading teams, mentoring junior staff, and promoting a culture of continuous improvement and learning.
- Ability to analyze complex data to improve system performance and predict future challenges.
- Experience in handling outages and the ability to lead incident response efforts, minimizing impact on services.
- Understanding of network architecture, protocols, and security practices to ensure robust and secure systems.
- Skills/understanding of performance tuning and optimization of systems and applications.
- Knowledge of database administration and management, particularly in configuring, managing, and scaling databases.
Experience in planning and executing disaster recovery strategies to ensure data integrity and availability.