🔔 FCM Loaded

Senior AI Systems Engineer – Distributed Training & GPU Optimization

Google

5 - 10 years

Bengaluru

Posted: 29/01/2026

Getting a referral is 5x more effective than applying directly

Job Description

We are hiring an engineer who has personally built and optimized distributed training systems for large AI models and has deep, real-world experience optimizing GPU workloads specifically on Google Cloud.

This is not a research role, not a general ML engineer role, and not cloud-agnostic.


Core Responsibilities

Distributed Training (Foundation-Scale)

  • Build and operate multi-node, multi-GPU distributed training systems (16128+ GPUs).
  • Implement and tune:
  • PyTorch Distributed (DDP, FSDP, TorchElastic)
  • DeepSpeed (ZeRO-2 / ZeRO-3, CPU/NVMe offload)
  • Hybrid parallelism (data, tensor, pipeline)
  • Create reusable distributed training frameworks and templates for large models.
  • Handle checkpoint sharding, failure recovery, and elastic scaling.

GPU Optimization (Google Cloud Only)

  • Optimize GPU utilization and cost on Google Cloud GPUs:
  • A100, H100, L4
  • Achieve high utilization through:
  • Mixed precision (FP16 / BF16)
  • Gradient checkpointing
  • Memory optimization and recomputation
  • Tune NCCL communication (All-Reduce, All-Gather) for multi-node GCP clusters.
  • Reduce GPU idle time and cost per training run.

Google Cloud Execution

  • Run and optimize training jobs using:
  • Vertex AI custom training
  • GKE with GPU node pools
  • Compute Engine GPU VMs
  • Optimize GPU scheduling, scaling, and placement.
  • Use preemptible GPUs safely for large training jobs.

Performance Profiling

  • Profile and debug GPU workloads using:
  • NVIDIA Nsight Systems / Compute
  • DCGM
  • Identify compute, memory, and communication bottlenecks.
  • Produce performance benchmarks and optimization reports.


Required Experience (Recruiter Screening Criteria)

Must-Have Experience (Non-Negotiable)

  • 8+ years in ML systems, distributed systems, or HPC
  • Hands-on experience scaling multi-node GPU training (16+ GPUs)
  • Deep expertise in:
  • PyTorch Distributed
  • DeepSpeed
  • NCCL
  • Direct production experience on Google Cloud GPUs
  • Proven record of GPU performance and cost optimization
  • Strongly Preferred
  • Experience training foundation models / LLM-scale models
  • Experience with Vertex AI + GKE
  • Experience optimizing GPU workloads at enterprise scale

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.