🔔 FCM Loaded

AI Infrastructure Engineer

HCLTech

2 - 5 years

Noida

Posted: 28/02/2026

Getting a referral is 5x more effective than applying directly

Job Description

AI Infrastructure Engineer- L3


The Role


The AI Infrastructure Engineer (L3) provides advanced engineering and architectural expertise for highperformance AI and ML infrastructure. This role focuses on building, optimizing, and scaling GPU/accelerator environments and distributed systems for largescale training and inference workloads.

Competency Focus: Highperformance computing (HPC), distributed systems, Kubernetes, GPU orchestration, cloud optimization


Keywords: Nvidia GPU Infrastructure, Kubernetes, GPU Cluster Administrator, Infrastructure SME, RCA


Responsibilities:

  • Deploy, configure, and manage GPU and AI accelerator platforms (NVIDIA A100/H100/L40, AMD Instinct, TPU).
  • Troubleshot GPU hardware and software issues, including failures, thermal throttling, PCIe/NVLink topology, and driver conflicts.
  • Install, upgrade, and maintain GPU software stacks, including drivers, CUDA, cuDNN, TensorRT, and firmware.
  • Perform capacity planning and resource optimization for AI training, finetuning, and inference workloads.
  • Optimize Linux systems (Ubuntu, RHEL, Rocky) for AI/HPC workloads through NUMA, kernel, and clock tuning.
  • Manage distributed and highperformance storage systems, including BeeGFS, Lustre, Ceph, and highthroughput NFS.
  • Operate highbandwidth, lowlatency networks, including InfiniBand, RoCE, RDMA, and NVLink.
  • Administer Kubernetes GPU clusters, leveraging NVIDIA GPU Operator, device plugins, MIG, and node feature discovery.
  • Support AI and HPC orchestration platforms, including Kubeflow, Ray, MLflow, and Slurm/PBS.
  • Configure and manage GPU scheduling and sharing strategies, such as node pools, quotas, job queues, and fairshare policies.
  • Optimize distributed training workflows using NCCL, PyTorch Distributed, Horovod, and DeepSpeed.
  • Operate and tune LLM and inference runtimes, including vLLM, Triton Inference Server, and TensorRTLLM.
  • Monitor and tune GPU utilization, memory allocation, and container-level performance.
  • Automate cluster provisioning and operations using Terraform, Helm, Customize, and GitOps (ArgoCD/Flux).
  • Build automation for GPU diagnostics, node onboarding, and model deployment workflows.
  • Implement observability and telemetry using Prometheus, Grafana, NVIDIA DCGM, and OpenTelemetry.
  • Lead deepdive root cause analysis for GPU, network, storage, and orchestration issues.
  • Provide L3 support and work with L2/L1 teams for escalations.
  • Drive production readiness, patching, hotfix rollout, and reliability improvements across AI infrastructure.
  • Troubleshoot & escalation for complex platform failures
  • Deep debugging of: NCCL hangs, GPU fabric issues and co-ordinate with OEM and support vendors on critical issues
  • Review RCA, architecture documents, and change plans
  • Act as technical advisor to leadership and customers


Qualifications & Experience

  • Bachelors degree in computer science, Engineering, Information Technology, or related field
  • 812 years of overall infrastructure or platform engineering experience
  • 46 years of specialized experience supporting AI/ML workloads
  • Demonstrated experience in largescale GPU/accelerated computing and distributed systems
  • Strong experience in Kubernetes, containerization, and orchestration tools
  • Understanding of AI workload and MLOps


Certifications Required

  • NVIDIA Certified Associate AI Infrastructure
  • NVIDIA NPN Certification
  • NVIDIA Base Command Manager certification
  • AWS Solutions Architect Associate
  • CKA Certified Kubernetes Administrator
  • CKAD Certified Kubernetes Application Developer

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.