AI Infrastructure Engineer- L3

The Role

The AI Infrastructure Engineer (L3) provides advanced engineering and architectural expertise for highperformance AI and ML infrastructure. This role focuses on building, optimizing, and scaling GPU/accelerator environments and distributed systems for largescale training and inference workloads.

Competency Focus: Highperformance computing (HPC), distributed systems, Kubernetes, GPU orchestration, cloud optimization

Keywords: Nvidia GPU Infrastructure, Kubernetes, GPU Cluster Administrator, Infrastructure SME, RCA

Responsibilities:

Deploy, configure, and manage GPU and AI accelerator platforms (NVIDIA A100/H100/L40, AMD Instinct, TPU).
Troubleshot GPU hardware and software issues, including failures, thermal throttling, PCIe/NVLink topology, and driver conflicts.
Install, upgrade, and maintain GPU software stacks, including drivers, CUDA, cuDNN, TensorRT, and firmware.
Perform capacity planning and resource optimization for AI training, finetuning, and inference workloads.
Optimize Linux systems (Ubuntu, RHEL, Rocky) for AI/HPC workloads through NUMA, kernel, and clock tuning.
Manage distributed and highperformance storage systems, including BeeGFS, Lustre, Ceph, and highthroughput NFS.
Operate highbandwidth, lowlatency networks, including InfiniBand, RoCE, RDMA, and NVLink.
Administer Kubernetes GPU clusters, leveraging NVIDIA GPU Operator, device plugins, MIG, and node feature discovery.
Support AI and HPC orchestration platforms, including Kubeflow, Ray, MLflow, and Slurm/PBS.
Configure and manage GPU scheduling and sharing strategies, such as node pools, quotas, job queues, and fairshare policies.
Optimize distributed training workflows using NCCL, PyTorch Distributed, Horovod, and DeepSpeed.
Operate and tune LLM and inference runtimes, including vLLM, Triton Inference Server, and TensorRTLLM.
Monitor and tune GPU utilization, memory allocation, and container-level performance.
Automate cluster provisioning and operations using Terraform, Helm, Customize, and GitOps (ArgoCD/Flux).
Build automation for GPU diagnostics, node onboarding, and model deployment workflows.
Implement observability and telemetry using Prometheus, Grafana, NVIDIA DCGM, and OpenTelemetry.
Lead deepdive root cause analysis for GPU, network, storage, and orchestration issues.
Provide L3 support and work with L2/L1 teams for escalations.
Drive production readiness, patching, hotfix rollout, and reliability improvements across AI infrastructure.
Troubleshoot & escalation for complex platform failures
Deep debugging of: NCCL hangs, GPU fabric issues and co-ordinate with OEM and support vendors on critical issues
Review RCA, architecture documents, and change plans
Act as technical advisor to leadership and customers

Qualifications & Experience

Bachelors degree in computer science, Engineering, Information Technology, or related field
812 years of overall infrastructure or platform engineering experience
46 years of specialized experience supporting AI/ML workloads
Demonstrated experience in largescale GPU/accelerated computing and distributed systems
Strong experience in Kubernetes, containerization, and orchestration tools
Understanding of AI workload and MLOps

Certifications Required

NVIDIA Certified Associate AI Infrastructure
NVIDIA NPN Certification
NVIDIA Base Command Manager certification
AWS Solutions Architect Associate
CKA Certified Kubernetes Administrator
CKAD Certified Kubernetes Application Developer

AI Infrastructure Engineer

HCLTech

Job Description

Services you might be interested in

We Search & Apply Jobs for You!