Skills:
Machine Learning, Architecture Design, GPU, Data Centre, Artificial Intelligence (AI), Certified Kubernetes Administrator,
Job Summary
We are seeking an experienced AI/GPU Infrastructure Architect with Total 15 Years of experience to design and optimize GPU-based infrastructure for machine learning and artificial intelligence applications. The ideal candidate will have a deep understanding of AI workloads, GPU architecture, and infrastructure design, along with hands-on experience in deploying scalable and efficient solutions.
Key Responsibilities
- Architecture Design: Develop and implement architecture for AI and GPU infrastructure, ensuring scalability, reliability, and performance.
- Infrastructure Optimization: Optimize GPU resource allocation and management to enhance performance for AI workloads.
- Collaboration: Work closely with data scientists, software engineers, and IT teams to understand requirements and translate them into architectural solutions.
- Performance Monitoring: Set up monitoring and benchmarking tools to assess system performance and make recommendations for improvement.
- Research and Development: Stay updated with the latest advancements in AI and GPU technologies and assess their applicability to our infrastructure.
- Documentation: Create and maintain architectural documentation, including design specifications, best practices, and deployment guides.
- Security and Compliance: Ensure that infrastructure designs meet security standards and compliance requirements.
- Training and Support: Provide guidance and training to teams on infrastructure usage and best practices.
Required Skills
- Proven experience as an infrastructure architect in the field of Data Centre Infrastructure (DC Rack Planning, Compute, Storage, Network) , specifically with AI and GPU technologies.
- Strong understanding of GPU architecture and parallel processing concepts.
- Experience of designing high performance storage solutions for AI kind of workloads with innovative solutions.
- Knowledge of designing high performance network solutions for AI Workloads using technologies like InfiniBand, ROCE, GPU Direct etc..
- Experience with distributed computing and microservices architecture
- Proficiency in cloud platforms (e.g., AWS, Azure, Google Cloud) and containerization technologies (e.g., Docker, Kubernetes).
- Familiarity with AI frameworks (e.g., TensorFlow, PyTorch) and ML libraries.
- Experience with system performance tuning and optimization techniques.
- Knowledge of security best practices related to AI and infrastructure management.
- Excellent problem-solving skills and ability to work in a collaborative environment.
- Strong communication skills, both verbal and written
Qualifications
- Bachelors in Engineers / MCA
- Certification in relevant technologies (e.g., Azure Certified Solutions Architect, NVIDIA certifications, Certified Kubernetes Administrator).