Responsibilities & Skills
- Experience with Datadog or similar, setting up monitors, alerting systems, anomaly management and forecasting. A desire to drive a proactive approach to scalability.
- Medium to advanced level understanding of Postgres databases, having dealt with databases at scale, understanding how to tweak parameters, optimize sql queries, and knowledge of AWS RDS in particular.
- Excellent understanding of HA architectures built in AWS.
- At least mid level knowledge of DNS, SSL, AWS networking, Docker, and ECS.
- Working knowledge of security principles in the cloud and a familiarity with the AWS Well Architected Framework.
- Cool under pressure, able to manage incidents involving multiple systems, communicate effectively internally and externally using tools like StatusPage and PagerDuty, marshal resources, and get things resolved, including writing blameless postmortems.
- Comfortable in taking (very occasional) pager alerts during working hours and sometimes weekends (we generally try to avoid night time pager alerts, as we do have staff in Europe and can split pager duty across timezones). You will not be the only on-call staff, but you will be in charge of primary incident response and leadership and training of other developers in response and mitigation.
Skills: reliability,ssl,database management system (dbms),docker,amazon web services (aws)