The Incident and Problem Management Analyst is part of BlackLine's Cloud Operations group and is jointly responsible for ensuring timely incident response and Problem Management. This is a multi-discipline role within the Service Management team. Responsibilities include major and escalated incident management, Problem Management, internal stakeholder and external customer communications, and reporting.
Responsibilities:
Confidently lead the recovery of high profile, major and crisis technology incidents within complex environments. Focusing fully on the restoration of service and minimum disruption using various methodologies and management techniques.
Acting as a single source of status information, delivering clear, timely and accurate business and technical communications across the organization.
Take joint responsibility in the governance of the Incident and Problem Management end to end process with cross technology teams ensuring all KPI's are met, and a high standard of management and reporting are consistently achieved.
Lead and facilitate post-mortem and RCA tasks for high-priority incidents.
Produce comprehensive incident and problem reports to all required audiences.
Co-own Problem Management activities for all managed incidents.
Identify individual and at scale emerging problems and escalate issues into Problem Management queue.
Monitor incident and problem management queues to ensure the OLA's are met.
Conduct Root Cause Analysis for all escalated incidents and Problem Management tickets.
Qualifications:
Years of Experience in Related Field: 3-5 years
Technical/Specialized Knowledge, Skills, and Abilities:
Bachelor's degree in related field
Excellent English language writing and speaking skills.
Experience of working within organizations aligned to Service Management methodologies such as ITIL. ITIL v3 or ITIL v4 Foundation's certification is beneficial.
Professional experience aligned to ITIL Service Management practices for Incident and Problem Management in a global organization.
Understands the complete incident workflow from incident inception, to postmortem, and action item follow-up.
Skilled at documenting incident artifacts
Understands how to use the 5-whys and/or cause and affect analysis to get to the real root of the problem.
Experience operating in multi-platform Technology environments.
Understanding of Three-Tier architecture design and implementation.
Understanding of exposure to SQL, large scale storage systems, Microservices, and API technologies.
Understanding of and exposure to DevOps, Observability, SRE and Agile.
Able to effectively explain technical issues and situations in non-technical terms for business stakeholders.
Understanding of general technology concepts, networking, server management, application development, operating systems.
Calm under pressure
Participate in overnight and weekend on-call shifts.
Wants to rush to the computer when paged with a mindset of resolving the incident quickly and efficiently.
Assist in continuous improvement projects.
Knowledge and experience with Jira, PagerDuty, MS-Teams, Confluence, monitoring and logging software search is beneficial.