We are hiring an engineer who has personally built and optimized distributed training systems for large AI models and has deep, real-world experience optimizing GPU workloads specifically on Google Cloud.

This is not a research role, not a general ML engineer role, and not cloud-agnostic.

Core Responsibilities

Distributed Training (Foundation-Scale)

Build and operate multi-node, multi-GPU distributed training systems (16128+ GPUs).
Implement and tune:
PyTorch Distributed (DDP, FSDP, TorchElastic)
DeepSpeed (ZeRO-2 / ZeRO-3, CPU/NVMe offload)
Hybrid parallelism (data, tensor, pipeline)
Create reusable distributed training frameworks and templates for large models.
Handle checkpoint sharding, failure recovery, and elastic scaling.

GPU Optimization (Google Cloud Only)

Optimize GPU utilization and cost on Google Cloud GPUs:
A100, H100, L4
Achieve high utilization through:
Mixed precision (FP16 / BF16)
Gradient checkpointing
Memory optimization and recomputation
Tune NCCL communication (All-Reduce, All-Gather) for multi-node GCP clusters.
Reduce GPU idle time and cost per training run.

Google Cloud Execution

Run and optimize training jobs using:
Vertex AI custom training
GKE with GPU node pools
Compute Engine GPU VMs
Optimize GPU scheduling, scaling, and placement.
Use preemptible GPUs safely for large training jobs.

Performance Profiling

Profile and debug GPU workloads using:
NVIDIA Nsight Systems / Compute
DCGM
Identify compute, memory, and communication bottlenecks.
Produce performance benchmarks and optimization reports.

Required Experience (Recruiter Screening Criteria)

Must-Have Experience (Non-Negotiable)

8+ years in ML systems, distributed systems, or HPC
Hands-on experience scaling multi-node GPU training (16+ GPUs)
Deep expertise in:
PyTorch Distributed
DeepSpeed
NCCL
Direct production experience on Google Cloud GPUs
Proven record of GPU performance and cost optimization
Strongly Preferred
Experience training foundation models / LLM-scale models
Experience with Vertex AI + GKE
Experience optimizing GPU workloads at enterprise scale

Senior AI Systems Engineer – Distributed Training & GPU Optimization

Google

Job Description

Services you might be interested in

Improve Your Resume Today