🔔 FCM Loaded

GPU Optimization Engineer

Taglynk

2 - 5 years

Bengaluru

Posted: 08/01/2026

Getting a referral is 5x more effective than applying directly

Job Description

Role

Were hiring a GPU Optimization Engineer who understands GPUs at a deep, architectural level someone who knows exactly how to squeeze every last millisecond out of a model, what GPU constraints matter, and how to restructure models for real-world inference performance. Youll work across CUDA kernels, model graph optimizations, hardware-specific tuning, and porting models across GPU architectures. Your work directly impacts the latency, throughput, and reliability of smallests real-time speech models.


What Youll Do

  • Optimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardware
  • Profile models end-to-end to identify GPU bottlenecks memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraints
  • Design and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sections
  • Perform operator fusion, graph optimization, and kernel-level scheduling improvements
  • Tune models to fit GPU memory limits while maintaining quality
  • Benchmark and calibrate inference across NVIDIA, AMD, and potentially emerging accelerators
  • Port models across GPU chipsets (NVIDIA AMD / edge GPUs / new compute backends)
  • Work with TensorRT, ONNX Runtime, and custom runtimes for deployment
  • Partner with the research and infra teams to ensure the entire stack is optimized for real-time workloads

Requirements

  • Strong understanding of GPU architecture SMs, warps, memory hierarchy, occupancy tuning
  • Hands-on experience with CUDA , kernel writing, and kernel-level debugging
  • Experience with kernel fusion and model graph optimizations
  • Familiarity with TensorRT, ONNX, Triton, tinygrad, or similar inference engines
  • Strong proficiency in PyTorch and Python
  • Deep understanding of model architectures (transformers, convs, RNNs, attention, diffusion blocks)
  • Experience profiling GPU workloads using Nsight, nvprof, or similar tools
  • Strong problem-solving abilities with a performance-first mindset


Great to Have

  • Experience with quantization (INT8, FP8, hybrid formats)
  • Experience with audio/speech models (ASR, TTS, SSL, vocoders)
  • Contributions to open-source GPU stacks or inference runtimes
  • Published work related to systems-level model optimization


Who Will Succeed in This Role

Someone who:

  • thinks in kernels, not just layers
  • knows which optimizations are theoretical vs practically impactful
  • understands GPU boundaries (memory, bandwidth, latency) and how to work around them
  • is excited by the challenge of ultra-low latency and large-scale real-time inference
  • loves debugging at the CUDA + model level

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.