Monitor application operational performance and reliability indicators. Do basic analysis to find the cause, notify the relevant teams and drive for resolution.
Identify/publish the required metrics to ensure visibility into all aspects of our application s performance and stability from global/customer perspective.
Define SLOs specific to services/products/customers.
Building dashboards on Splunk and other monitoring tools, and alerts that continuously monitors the identified metrics and SLOs and reporting the Violators.
Provide/Document RCA analysis on the Production outages/maintenance and provide automation solution which can improve/reduce the downtime.
Develop automation code for infrastructure needs, testing solutions, failover solutions, failure mitigation, and much more.
Proactively responding to alerting, incidents and making sure alerts/dashboards are up to date.
Work independently and within a team to triage the production outages/maintenance and work towards the remediation of the same.
Basic Qualifications:
Bachelors degree or above in Computer Science, or related Engineering discipline.
Minimum 1+ year of total industry experience.
Minimum 6 months experience in building Dashboard/Report/Alerts in Splunk or any similar monitoring tool.
Minimum 3 months experience in AWS.
Hands on experience with any scripting language like Python.
Good oral and written communication skills.
Preferred Qualifications:
Hands on with Linux/Windows fundamentals, Shell scripting, hardware performance tuning/scalability, mitigating issues related to networking/security.
Experience working with RDBMS (MSSQL, DB2), NoSQL, Caching, Queuing, knowledge on firewalls and load-balancers.