Login Sign Up
🔔 FCM Loaded

AI Infrastructure Engineer- L2

HCLTech

7 - 9 years

Noida

Posted: 06/03/2026

Getting a referral is 5x more effective than applying directly

Job Description

Job Title: AI Infrastructure Engineer- L2

Experience:5 years to 8 years

Location:Noida,Bangalore,chennai,Hyderabad



The Role


The AI Infrastructure Engineer (L2) is responsible for implementing, operating, and supporting AI/ML infrastructure environments under defined architectural guidelines. The role ensures reliable, secure, and scalable platforms for model development, training, and inference.


Responsibilities:


  • Deploy, configure, and maintain GPU/AI accelerator servers (NVIDIA A100/H100/L40, AMD Instinct, TPU) as per defined standards.
  • Perform routine GPU hardware health checks, including assisted replacements, firmware updates, and node bringup activities.
  • Troubleshoot common GPU issues such as driver failures, thermal throttling, PCIe/NVLink alignment problems, and compatibility conflicts.
  • Install, upgrade, and validate GPU software stacks, including NVIDIA drivers, CUDA, cuDNN, TensorRT, and related libraries.
  • Ensure software stack consistency across GPU nodes in alignment with reference architectures.
  • Assist in compatibility testing for new GPU drivers, CUDA, and library releases.
  • Configure and maintain Linux systems (Ubuntu, RHEL, Rocky Linux) for AI workloads.
  • Apply OSlevel tuning for NUMA alignment, kernel parameters, CPU pinning, and clock settings as per documented guidelines.
  • Operate Kubernetes GPU clusters, including NVIDIA GPU Operator, device plugins, MIG configuration, and node feature discovery.
  • Perform GPU node onboarding, node labeling, and routine cluster maintenance tasks.
  • Support Kubernetes upgrades and patching activities under L3 guidance.
  • Configure and maintain GPU scheduling controls, including node pools, resource quotas, and job queues.
  • Monitor GPU allocation and assist in resolving scheduling or contention issues.
  • Enforce fair usage and workload isolation policies defined by platform standards.
  • Support execution of distributed training workloads using: NCCL, PyTorch Distributed, Horovod, DeepSpeed
  • Operate and maintain inference runtimes such as: vLLM, Triton Inference Server, TensorRT LLM
  • Assist ML teams in diagnosing performance or runtime errors.
  • Operate and monitor distributed file systems (BeeGFS, Lustre, Ceph, high throughput NFS).
  • Support high bandwidth networking components: InfiniBand, RoCE, RDMA, NVLink
  • Escalate complex performance or connectivity issues to L3 engineers.
  • Monitor GPU, node, and cluster health using: Prometheus, Grafana, NVIDIA DCGM, OpenTelemetry
  • Use Infrastructure as Code and automation tools such as: Terraform, Helm, Kustomize, GitOps tools (ArgoCD / Flux)
  • Provide L2 operational support for AI infrastructure issues.
  • Perform initial root cause analysis for GPU, Kubernetes, network, and storage incidents.
  • Escalate complex issues to L3 with detailed findings and logs.
  • Support production readiness activities including patching, hotfixes, and routine maintenance.
  • Ability to work under SLA and production pressure


Qualifications & Experience


  • 37 years of experience in infrastructure, SRE, or platform engineering roles.
  • Strong Linux administration skills (Ubuntu/RHEL/Rocky).
  • Good understanding of GPU servers, CUDA toolkit, drivers, and monitoring.
  • Hands-on exposure to Kubernetes operations (preferably GPU-enabled clusters).
  • Experience with automation tools (Terraform, Helm, GitOps) and container runtimes.
  • Knowledge of distributed storage and HPC networking (InfiniBand/RDMA) is a plus.


Certifications Required


  • Public Cloud (AWS, Azure, GCP) Certified Practitioner (Foundation)
  • Kubernetes and Cloud Native Associate
  • Linux Foundation Certified System Administrator

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.