AI Infrastructure Engineer- L2
HCLTech
7 - 9 years
Noida
Posted: 06/03/2026
Getting a referral is 5x more effective than applying directly
Job Description
Job Title: AI Infrastructure Engineer- L2
Experience:5 years to 8 years
Location:Noida,Bangalore,chennai,Hyderabad
The Role
The AI Infrastructure Engineer (L2) is responsible for implementing, operating, and supporting AI/ML infrastructure environments under defined architectural guidelines. The role ensures reliable, secure, and scalable platforms for model development, training, and inference.
Responsibilities:
- Deploy, configure, and maintain GPU/AI accelerator servers (NVIDIA A100/H100/L40, AMD Instinct, TPU) as per defined standards.
- Perform routine GPU hardware health checks, including assisted replacements, firmware updates, and node bringup activities.
- Troubleshoot common GPU issues such as driver failures, thermal throttling, PCIe/NVLink alignment problems, and compatibility conflicts.
- Install, upgrade, and validate GPU software stacks, including NVIDIA drivers, CUDA, cuDNN, TensorRT, and related libraries.
- Ensure software stack consistency across GPU nodes in alignment with reference architectures.
- Assist in compatibility testing for new GPU drivers, CUDA, and library releases.
- Configure and maintain Linux systems (Ubuntu, RHEL, Rocky Linux) for AI workloads.
- Apply OSlevel tuning for NUMA alignment, kernel parameters, CPU pinning, and clock settings as per documented guidelines.
- Operate Kubernetes GPU clusters, including NVIDIA GPU Operator, device plugins, MIG configuration, and node feature discovery.
- Perform GPU node onboarding, node labeling, and routine cluster maintenance tasks.
- Support Kubernetes upgrades and patching activities under L3 guidance.
- Configure and maintain GPU scheduling controls, including node pools, resource quotas, and job queues.
- Monitor GPU allocation and assist in resolving scheduling or contention issues.
- Enforce fair usage and workload isolation policies defined by platform standards.
- Support execution of distributed training workloads using: NCCL, PyTorch Distributed, Horovod, DeepSpeed
- Operate and maintain inference runtimes such as: vLLM, Triton Inference Server, TensorRT LLM
- Assist ML teams in diagnosing performance or runtime errors.
- Operate and monitor distributed file systems (BeeGFS, Lustre, Ceph, high throughput NFS).
- Support high bandwidth networking components: InfiniBand, RoCE, RDMA, NVLink
- Escalate complex performance or connectivity issues to L3 engineers.
- Monitor GPU, node, and cluster health using: Prometheus, Grafana, NVIDIA DCGM, OpenTelemetry
- Use Infrastructure as Code and automation tools such as: Terraform, Helm, Kustomize, GitOps tools (ArgoCD / Flux)
- Provide L2 operational support for AI infrastructure issues.
- Perform initial root cause analysis for GPU, Kubernetes, network, and storage incidents.
- Escalate complex issues to L3 with detailed findings and logs.
- Support production readiness activities including patching, hotfixes, and routine maintenance.
- Ability to work under SLA and production pressure
Qualifications & Experience
- 37 years of experience in infrastructure, SRE, or platform engineering roles.
- Strong Linux administration skills (Ubuntu/RHEL/Rocky).
- Good understanding of GPU servers, CUDA toolkit, drivers, and monitoring.
- Hands-on exposure to Kubernetes operations (preferably GPU-enabled clusters).
- Experience with automation tools (Terraform, Helm, GitOps) and container runtimes.
- Knowledge of distributed storage and HPC networking (InfiniBand/RDMA) is a plus.
Certifications Required
- Public Cloud (AWS, Azure, GCP) Certified Practitioner (Foundation)
- Kubernetes and Cloud Native Associate
- Linux Foundation Certified System Administrator
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
