Bachelor's degree in Computer Science, Engineering, or a related field.
1+ year of experience in a relevant role, such as Site Reliability Engineer, DevOps Engineer, or similar, is preferred but not mandatory.
Basic understanding of AWS solutions including EC2, S3, CloudWatch, Lambda, and RDS.
Interest and understanding of Platform Engineering concepts and principles.
Familiarity with monitoring and observability tools such as Prometheus, Grafana, or ELK stack
Job Description
Assist in designing, implementing, and maintaining scalable monitoring, alerting, and logging solutions to ensure the availability and performance of backend services.
Support the development and implementation of observability tools and practices to derive actionable insights from operational data.
Collaborate with development teams to design and support scalable, reliable, and resilient systems on AWS.
Contribute to platform engineering projects to create automated solutions for deployment, scaling, and operations.
Assist in the development of disaster recovery plans, and participate in DR and capacity testing under supervision.
Analyze system performance and provide recommendations for optimization and improvement.
Participate in post-incident reviews, identifying root causes and preventive measures.
Maintain documentation for system procedures, operations, and architecture configurations.
Stay up-to-date with industry best practices, tools, and technologies related to site reliability and platform engineering.
Participate in on-call rotations as needed to ensure 24/7 availability of critical systems and services.