Job Description
Job Description: HPC Architect The HPC Cloud Architect will administer high performance scientific computing platforms, infrastructure, and support research projects. Utilize your experience in multiple disciplines including high performance computing (HPC), cloud, architecture, design, network, security and systems to implement and provide advanced system engineering services to customer. Manage, administer and support daily operation of computing systems both onsite and in the cloud. Design, implement and maintain scalable High-Availability (HA) and Fault-Tolerant (FT) computing systems. Following the best cloud computing practice by utilizing Amazon Virtual Private Cloud (VPC), Amazon Elastic Computing Cloud (EC2) and other advanced technical cloud features. Investigate and provide technical options to managers and researchers for selecting effective computing solutions based on requirements. Experience: 5-10 years of hands-on systems administration/engineering experience with Linux. Experience with high performance computing systems in Life Sciences will be added advantage, Engineering, Manufacturing or Financial Services Minimum three years with Amazon Web Services (AWS) cloud computing. Extensive administration experience in GPU-based platforms. Excellent written and oral communication skills and ability to work with people at every level. Required Skills: Demonstrated experience in optimizing computing performance and measurement. Comprehensive knowledge of security compliance and security control. Proficient skills in shell scripting, Ruby, Perl or Python. Excellent organization and time management skills and ability to identify priorities to accomplish a variety of tasks simultaneously. Comprehensive knowledge in Configuration Management (CM) process and software development tools such as Git, GitLab, Nexus, Jenkins, Maven or JIRA. Working knowledge of HPC schedulers and distributed/parallel file systems, underlying IT systems, and the HPC development process, high throughput and tight coupling approaches Knowledge of statistics, numeric modeling, data analyzing and machine learning. AWS certification at Professional level. Experience with cloud CLI and SDK. An understanding of the cloud computing delivery model as it relates to HPC Knowledge of the underlying infrastructure requirements such as Networking, Storage, and Hardware Optimization. Experience in a customer-facing, sales-aligned role such as consultant, solutions engineer or solutions architect Track record of implementing AWS services in a variety of business environments such as large enterprises and start-ups. AWS Certification, eg. AWS Solutions Architect Associate Understanding of application, server, and network security Experience in DevOps tools like Ansible Tower, Bitbucket, Terraform, and CloudFormation etc. Experience in Linux Administration various distributions like Redhat, Amazon, CentOS Experience with job schedulers like Grid Engine, LSF, PBS, SLURM, Torque, Symphony, TIBCO. Experience with compilers and libraries such as MPI, GCC, CUDA etc. Experience with scripting (bash, Python, PowerShell, etc.). Experience in Filesystems like NFS, Lustre/GPFS, etc., Experience in Application installations and troubleshooting on HPC Clusters based on CPU, GPU. Certifications (Desirable) AWS Administrator Professional or up Linux Administration AWS SysOps A Responsibilities Responsible to architect a framework that is more readily available and demonstrate ease of use. When factoring new architecture make build v/s buy decision and consider cost aspects. Work in coordination with other internal teams to ensure the infrastructure fully and effectively supports current and planned application systems. Troubleshoot OS, Networking, Storage, and Software issues while leveraging internal teams for solutions. Deliver changes to the HPC production platforms according to the change control process. Communicating and seeking approvals from business owners. Practice network asset management, including maintenance of network component inventory and related documentation Develop tools to deploy, manage, monitor, and troubleshoot HPC systems at scale. Maintain asset lists of all servers, applications and licensing ensuring compliancy. Maintain security standards according to internal policies. Execute the day-to-day activities of the Incident Management process Manage and respond to tickets/requests in accordance with SLA timeframes. Develop tools to deploy, manage, monitor, and troubleshoot HPC systems at scale. Optional Skills Docker, Singularity, Kubernetes, GCP will be a plus. Knowledge of distributed computing Ansible, Jira, Confluence, Service Now, Excel, Presentation Skills Worked on building clusters with individual machines (not a service like EMR etc