About Simplismart
A bit about our product - Simplismart is an MLOps platform with 3 major suites:
- Training suite: Assemble and train any model, including LLMs, vision, audio, tabular, and tree models.
- Deployment suite: Most companies fail to make models production-ready. Our proprietary model deployment suite is 6x faster than HuggingFaces enterprise suite and 12x faster than replicate.ai. Users can easily deploy (auto-scale) models trained on Simplismart (more optimised), import any model from HuggingFace, or even a Pytorch/Tensorflow artefact: Tensorflow, Pytorch, ONNX, JAX.
- Observability suite: Monitor model health, including load, latency, uptime, data drift, and concept drift.
Position Overview
As a Cloud Engineer, you will contribute to building a highly available, global, multi-cloud PaaS platform using open-source technologies to support Simplismarts rapid growth. This system encompasses diverse environments (Kubernetes, VMs, bare metal compute) and provides a cohesive and reliable abstraction for running AI workloads. You will be able to work with cutting-edge technologies and solve complex problems.
To be successful in this role, you need to be deeply technical, possess strong communication and collaboration skills, and have experience in infrastructure-as-code. Proficiency with tools like Terraform and Ansible and strong software development fundamentals is essential. Additionally, you should have a good understanding of systems knowledge and troubleshooting abilities.
Requirements
- 5+ years of experience writing high-performance, well-tested, production-quality code and platform engineering.
- Proficiency in at least one backend programming language (Python desired; C++ is a plus)
- Demonstrated experience with high-performance or distributed cloud microservices architectures.
- Ideally, you should have experience building and operating globally using multiple cloud providers such as AWS, Azure, or GCP.
- A good understanding of low-level operating systems concepts, including multi-threading, memory management, networking and storage, performance, and scale.
- Pragmatic, methodical, well-organized, detail-oriented, and self-starting.
- Experience with Kubernetes, containerization, Terraform and Ansible.
- Experience with Pytorch or Tensorflow is a plus. (not necessary)
- Knowledge of GPU programming, NCCL and CUDA is a plus.
Responsibilities
- Designing the high-level architecture of the MLOps platform from the ground up.
- Handling formalisation of diverse GPU-based workloads.
- Developing a robust internal system for continuous deployment of various services and modules in diverse environments.
- Create frameworks for reliable and fault tolerant systems for mission-critical workloads.
Skills And Attributes
- Deep technical expertise.
- Strong communication and collaboration skills.
- Experience in infrastructure-as-code (Terraform, Ansible).
- Strong software development fundamentals.
- Good systems knowledge and troubleshooting abilities.
- Ability to work independently and as part of a team.
- Proactive and self-motivated.
Why should you join SimpliSmart
Well, let's break away from the conventional perks and instead focus on what you
WONT experience here:
- Legacy System Headaches: You won't have to endlessly grapple with outdated legacy systems that hinder your productivity and creativity.
- Bossy Culture: At SimpliSmart, we believe in collaboration and empowerment, not hierarchy. You won't have a boss breathing down your neck but instead, colleagues who support your growth.
- Dark Circles: Late nights and overwork are not the norm here. We prioritize work-life balance, ensuring you won't be sporting those tired, dark circles under your eyes.
- Stagnation: Say goodbye to redundant and stagnant tasks. We thrive on innovation and dynamic challenges that keep you engaged and motivated.
Skills: infrastructure,prometheus,grafana,terraform,kubernetes