Overview
As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our infrastructure and applications. You will work closely with our development and operations teams to build and maintain the necessary tools and systems to support our growing platform.
Key Responsibilities
- Design, build, and maintain infrastructure and tools for optimal operation, monitoring, and reliability of our systems.
- Collaborate with development teams to improve and support our continuous integration and delivery processes.
- Develop automation tools for provisioning, configuration, and deployment.
- Monitor system performance and troubleshoot issues as they arise.
- Participate in on-call rotation and incident response, resolving production issues in a timely manner.
- Implement best practices for security, reliability, and fault tolerance.
- Conduct capacity planning and performance analysis to support growth and scalability.
- Provide technical guidance and support to cross-functional teams.
- Participate in the design and implementation of disaster recovery and backup processes.
- Contribute to the documentation and dissemination of best practices.
Required Qualifications
- Bachelor's degree in Computer Science, Engineering, or related field.
- Proven experience in infrastructure management and operations.
- Proficiency in one or more scripting languages such as Python, Ruby, or Bash.
- Experience with automation tools like Ansible, Chef, or Puppet.
- Deep understanding of cloud technologies and providers like AWS, Azure, or GCP.
- Strong knowledge of monitoring systems such as Nagios, Zabbix, or Prometheus.
- Demonstrated ability in incident response and on-call support.
- Understanding of networking basics and protocols.
- Excellent collaboration and communication skills.
- Experience in implementing and maintaining security best practices.
- Knowledge of containerization tools like Docker or Kubernetes.
- Ability to work in a fast-paced, dynamic environment and prioritize tasks effectively.
- Experience with infrastructure as code principles using tools like Terraform or CloudFormation.
- Familiarity with version control systems such as Git.
- Strong problem-solving and troubleshooting skills.
Skills: infrastructure management,scripting languages,troubleshooting,automation tools,cloud technologies,incident response,collaboration,devops