Job Title: AI Infrastructure Engineer- L2

Experience:5 years to 8 years

Location:Noida,Bangalore,chennai,Hyderabad

The Role

The AI Infrastructure Engineer (L2) is responsible for implementing, operating, and supporting AI/ML infrastructure environments under defined architectural guidelines. The role ensures reliable, secure, and scalable platforms for model development, training, and inference.

Responsibilities:

Deploy, configure, and maintain GPU/AI accelerator servers (NVIDIA A100/H100/L40, AMD Instinct, TPU) as per defined standards.
Perform routine GPU hardware health checks, including assisted replacements, firmware updates, and node bringup activities.
Troubleshoot common GPU issues such as driver failures, thermal throttling, PCIe/NVLink alignment problems, and compatibility conflicts.
Install, upgrade, and validate GPU software stacks, including NVIDIA drivers, CUDA, cuDNN, TensorRT, and related libraries.
Ensure software stack consistency across GPU nodes in alignment with reference architectures.
Assist in compatibility testing for new GPU drivers, CUDA, and library releases.
Configure and maintain Linux systems (Ubuntu, RHEL, Rocky Linux) for AI workloads.
Apply OSlevel tuning for NUMA alignment, kernel parameters, CPU pinning, and clock settings as per documented guidelines.
Operate Kubernetes GPU clusters, including NVIDIA GPU Operator, device plugins, MIG configuration, and node feature discovery.
Perform GPU node onboarding, node labeling, and routine cluster maintenance tasks.
Support Kubernetes upgrades and patching activities under L3 guidance.
Configure and maintain GPU scheduling controls, including node pools, resource quotas, and job queues.
Monitor GPU allocation and assist in resolving scheduling or contention issues.
Enforce fair usage and workload isolation policies defined by platform standards.
Support execution of distributed training workloads using: NCCL, PyTorch Distributed, Horovod, DeepSpeed
Operate and maintain inference runtimes such as: vLLM, Triton Inference Server, TensorRT LLM
Assist ML teams in diagnosing performance or runtime errors.
Operate and monitor distributed file systems (BeeGFS, Lustre, Ceph, high throughput NFS).
Support high bandwidth networking components: InfiniBand, RoCE, RDMA, NVLink
Escalate complex performance or connectivity issues to L3 engineers.
Monitor GPU, node, and cluster health using: Prometheus, Grafana, NVIDIA DCGM, OpenTelemetry
Use Infrastructure as Code and automation tools such as: Terraform, Helm, Kustomize, GitOps tools (ArgoCD / Flux)
Provide L2 operational support for AI infrastructure issues.
Perform initial root cause analysis for GPU, Kubernetes, network, and storage incidents.
Escalate complex issues to L3 with detailed findings and logs.
Support production readiness activities including patching, hotfixes, and routine maintenance.
Ability to work under SLA and production pressure

Qualifications & Experience

37 years of experience in infrastructure, SRE, or platform engineering roles.
Strong Linux administration skills (Ubuntu/RHEL/Rocky).
Good understanding of GPU servers, CUDA toolkit, drivers, and monitoring.
Hands-on exposure to Kubernetes operations (preferably GPU-enabled clusters).
Experience with automation tools (Terraform, Helm, GitOps) and container runtimes.
Knowledge of distributed storage and HPC networking (InfiniBand/RDMA) is a plus.

Certifications Required

Public Cloud (AWS, Azure, GCP) Certified Practitioner (Foundation)
Kubernetes and Cloud Native Associate
Linux Foundation Certified System Administrator

AI Infrastructure Engineer- L2

HCLTech

Job Description

Services you might be interested in

Improve Your Resume Today