AI Infrastructure Engineer
HCLTech
2 - 5 years
Noida
Posted: 28/02/2026
Job Description
AI Infrastructure Engineer- L3
The Role
The AI Infrastructure Engineer (L3) provides advanced engineering and architectural expertise for highperformance AI and ML infrastructure. This role focuses on building, optimizing, and scaling GPU/accelerator environments and distributed systems for largescale training and inference workloads.
Competency Focus: Highperformance computing (HPC), distributed systems, Kubernetes, GPU orchestration, cloud optimization
Keywords: Nvidia GPU Infrastructure, Kubernetes, GPU Cluster Administrator, Infrastructure SME, RCA
Responsibilities:
- Deploy, configure, and manage GPU and AI accelerator platforms (NVIDIA A100/H100/L40, AMD Instinct, TPU).
- Troubleshot GPU hardware and software issues, including failures, thermal throttling, PCIe/NVLink topology, and driver conflicts.
- Install, upgrade, and maintain GPU software stacks, including drivers, CUDA, cuDNN, TensorRT, and firmware.
- Perform capacity planning and resource optimization for AI training, finetuning, and inference workloads.
- Optimize Linux systems (Ubuntu, RHEL, Rocky) for AI/HPC workloads through NUMA, kernel, and clock tuning.
- Manage distributed and highperformance storage systems, including BeeGFS, Lustre, Ceph, and highthroughput NFS.
- Operate highbandwidth, lowlatency networks, including InfiniBand, RoCE, RDMA, and NVLink.
- Administer Kubernetes GPU clusters, leveraging NVIDIA GPU Operator, device plugins, MIG, and node feature discovery.
- Support AI and HPC orchestration platforms, including Kubeflow, Ray, MLflow, and Slurm/PBS.
- Configure and manage GPU scheduling and sharing strategies, such as node pools, quotas, job queues, and fairshare policies.
- Optimize distributed training workflows using NCCL, PyTorch Distributed, Horovod, and DeepSpeed.
- Operate and tune LLM and inference runtimes, including vLLM, Triton Inference Server, and TensorRTLLM.
- Monitor and tune GPU utilization, memory allocation, and container-level performance.
- Automate cluster provisioning and operations using Terraform, Helm, Customize, and GitOps (ArgoCD/Flux).
- Build automation for GPU diagnostics, node onboarding, and model deployment workflows.
- Implement observability and telemetry using Prometheus, Grafana, NVIDIA DCGM, and OpenTelemetry.
- Lead deepdive root cause analysis for GPU, network, storage, and orchestration issues.
- Provide L3 support and work with L2/L1 teams for escalations.
- Drive production readiness, patching, hotfix rollout, and reliability improvements across AI infrastructure.
- Troubleshoot & escalation for complex platform failures
- Deep debugging of: NCCL hangs, GPU fabric issues and co-ordinate with OEM and support vendors on critical issues
- Review RCA, architecture documents, and change plans
- Act as technical advisor to leadership and customers
Qualifications & Experience
- Bachelors degree in computer science, Engineering, Information Technology, or related field
- 812 years of overall infrastructure or platform engineering experience
- 46 years of specialized experience supporting AI/ML workloads
- Demonstrated experience in largescale GPU/accelerated computing and distributed systems
- Strong experience in Kubernetes, containerization, and orchestration tools
- Understanding of AI workload and MLOps
Certifications Required
- NVIDIA Certified Associate AI Infrastructure
- NVIDIA NPN Certification
- NVIDIA Base Command Manager certification
- AWS Solutions Architect Associate
- CKA Certified Kubernetes Administrator
- CKAD Certified Kubernetes Application Developer
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
