Key Responsibilities:
Work with stakeholders such as product owners to define service level objectives (SLOs)
for system operations. Track performance against SLOs in partnership with monitoring
teams or other stakeholders, and ensure systems continue to meet SLOs over time.
Create dashboards and reports to communicate key metrics.
Collaborate with development teams to promote the concept of reliability engineering
during all phases of the software development lifecycle to detect and correct
performance issues and meet availability goals.
Design, code, test, and deliver software to automate manual operational work (i.e.,
toil). Conduct blameless post mortems to troubleshoot priority incidents.
Perform analytics on previous incidents to understand root causes and better predict
and prevent future issues. Use automation to reduce the probability and/or impact of
problem recurrence.
Identify, evaluate, and recommend monitoring tools and diagnostic techniques to
improve system observability. Participate in system design consulting, platform
management, capacity planning and launch reviews.
Drive continuous improvement in software quality and infrastructure reliability and
resilience. Oversee, design, implement, and manage DevOps capabilities using
continuous integration/continuous delivery toolsets and automation.
Skills and Experience:
Strong problem solving and analytical skills. Strong interpersonal and written and verbal.
communication skills.
Highly adaptable to changing circumstances. Interest in continuously learning new skills.
and technologies.
Experience with programming and scripting languages (Python, Bash, PowerShell).
Experience with incident and response management. Experience with Agile and DevOps
development methodologies.
Experience with container technologies and supporting tools (e.g. Docker
Swarm,Kubernetes).
Run and maintain our production infrastructure hosted on AWS.
Experience with working in AWS cloud ecosystems.(Lambda,Glue,Pyspark etc)
Experience with monitoring and observability tools such as Splunk, CloudWatch.
Experience with configuration management systems (e.g. Puppet, Ansible, Chef,
Terraform).
Experience working with continuous integration/continuous deployment tools (e.g. Git,
Jenkin).
Create meaningful dashboards/reports for application telemetry and infrastructure
health for pro-actively identifying performance constraints and bottlenecks.
Design and build custom tools as needed to support process optimization, challenging
the status-quo and improving operational efficiency.
Monitor, measure and improve the reliability, availability and scalability of IT
Infrastructure, applications and services
Additional Skills/Preferences:
Previous Pharmaceutical IT experience would an added advantage.
Demonstrated ability to coordinate cross-functional work teams toward task completion
Demonstrated ability to learn new technologies quickly
Excellent oral and written communication skills, presentation skills.
Excellent self-management skills. Strong problem-solving skills, Process improvement
skills
Organization skills, Self-management. Demonstrated ability to evaluate, facilitate, and
drive towards risk-based decision making. Proven team player and the ability to work in
a dynamic environment, where needs and priorities can change, with compressed.
deadlines.