Search by job, company or skills
As a Software Developer in GPU Infrastructure Automation, you will be responsible for designing, developing, and optimizing software solutions that effectively manage and schedule GPU resources. You will work closely with various software teams to ensure seamless integration and optimal performance of our GPU infrastructure.
Key Responsibilities:
Design and implement GPU cluster management and observability tools.
Develop tools and APIs for other computational layers.
Conduct performance profiling and optimization using tools like NVIDIA Nsight.
Participate in code reviews, design discussions, and continuous integration/continuous deployment (CI/CD) processes.
Validate GPU cluster performance with benchmarking tools likeMLPerf.
Implement and maintain synchronization mechanisms for managing concurrency and shared resources.
Developing infrastructure software tool kit for GPU clustering, capacity and scheduling automation
Required Skills and Qualifications:
Bachelor's orMaster's degree in Computer Scienceor related field.
Strong proficiency in Golang, C/C++, and experience with GPU schedulers like SLURM.
Strong proficiency in Kubernetes (K8) technologies
Strong proficiency in in one of the public cloud Infrastructure and PaaS technologies (AWS, GCP, Azure)
In-depth understanding of GPU architectures and parallel computing principles.
Excellent understanding of REST APIs and experience with threading, concurrency, and synchronization mechanisms.
Knowledge of Linux operating systems.
Familiarity with scheduling algorithms and load balancing techniques.
Strong understanding of data structures, algorithms, and numerical methods.
Proficient in creating and using well-structured CI/CD pipelines.
Excellent problem-solving skills and attention to detail.
Strong communication and teamwork abilities.
Date Posted: 18/06/2024
Job ID: 82139449