Senior AI Systems Engineer – Distributed Training & GPU Optimization
5 - 10 years
Bengaluru
Posted: 29/01/2026
Getting a referral is 5x more effective than applying directly
Job Description
We are hiring an engineer who has personally built and optimized distributed training systems for large AI models and has deep, real-world experience optimizing GPU workloads specifically on Google Cloud.
This is not a research role, not a general ML engineer role, and not cloud-agnostic.
Core Responsibilities
Distributed Training (Foundation-Scale)
- Build and operate multi-node, multi-GPU distributed training systems (16128+ GPUs).
- Implement and tune:
- PyTorch Distributed (DDP, FSDP, TorchElastic)
- DeepSpeed (ZeRO-2 / ZeRO-3, CPU/NVMe offload)
- Hybrid parallelism (data, tensor, pipeline)
- Create reusable distributed training frameworks and templates for large models.
- Handle checkpoint sharding, failure recovery, and elastic scaling.
GPU Optimization (Google Cloud Only)
- Optimize GPU utilization and cost on Google Cloud GPUs:
- A100, H100, L4
- Achieve high utilization through:
- Mixed precision (FP16 / BF16)
- Gradient checkpointing
- Memory optimization and recomputation
- Tune NCCL communication (All-Reduce, All-Gather) for multi-node GCP clusters.
- Reduce GPU idle time and cost per training run.
Google Cloud Execution
- Run and optimize training jobs using:
- Vertex AI custom training
- GKE with GPU node pools
- Compute Engine GPU VMs
- Optimize GPU scheduling, scaling, and placement.
- Use preemptible GPUs safely for large training jobs.
Performance Profiling
- Profile and debug GPU workloads using:
- NVIDIA Nsight Systems / Compute
- DCGM
- Identify compute, memory, and communication bottlenecks.
- Produce performance benchmarks and optimization reports.
Required Experience (Recruiter Screening Criteria)
Must-Have Experience (Non-Negotiable)
- 8+ years in ML systems, distributed systems, or HPC
- Hands-on experience scaling multi-node GPU training (16+ GPUs)
- Deep expertise in:
- PyTorch Distributed
- DeepSpeed
- NCCL
- Direct production experience on Google Cloud GPUs
- Proven record of GPU performance and cost optimization
- Strongly Preferred
- Experience training foundation models / LLM-scale models
- Experience with Vertex AI + GKE
- Experience optimizing GPU workloads at enterprise scale
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
