Responsibilities :
A day in the life of an Infoscion
As a Senior Site Reliability Engineer, you will play a critical role in providing expert guidance on Application and infrastructure best practices from reliability perspective.
Improve reliability, quality, and time-to-market of our suite of products/applications.
Define suitable metrics for system with SLO/SLI and setup observability mechanism to track it
Define error budget as per the SLO
Define strategy and setup up High Availability and Load Balancer based architecture
Drive a metrics-driven culture and software delivery process using data to measure overall system quality and reliability.
Balance feature development speed and reliability with well-defined service level objectives
Provide primary operational support and engineering for products/applications
Partner with solution architect and development teams to improve services reliability
Participate in system design, infra management and capacity planning
Participate in automating operational tasks and toil reduction
Provide automation solutions for performance management, disaster recovery, monitoring and observability
Work with business users to understand issues, develop root cause analysis and work with the development team for enhancements/fixes
Working on distributed traces to visualize the entire workflow and analyze the cause of problems/incidents
Improve security and performance of infrastructure and applications
Provide support, improve, and implement infrastructure as code
Define, evangelize, and maintain SRE best practices
Solutionize and implement DevSecOps best practices
Improve automation including system's self-healing capability
Manage and participate in on-call incidents, if required (Priority Incident) If you think you fit right in to help our clients navigate their next in their digital transformation
Additional Responsibilities:
AIOps and related tools.
Experience in CICD tooling and best practices
Systems Administration and operating system experience on Linux, windows, including an understanding of networking.
Experience working on ITSM tools like Remedy, ServiceNow, Confluence, Jira
Experience with Cloud cost optimization / FinOps
Technical and Professional Requirements:
Strong experience on one or more Observability tools like New Relic, AppDynamics, Prometheus, Dynatrace, Data Dog, Splunk etc.
Reliability practices
Chaos engineering
Experience in event correlation using observability tools like Dynatrace or other tools like BigPanda
Experience in defining SLI, SLO, Error budgets and its measurement
Experience in automation of infra scalability, infra fail over, infra-availability, performance management.
Experience in container orchestration and practices, including Kubernetes, Docker Swarm
Good experience in scripting or development languages, including expertise in Python, Ruby, JSON, Java, and Node.JS, PHP (anyone)
Experience with scripting in PowerShell(M) and Bash/Shell/Perl (anyone)