GPU Optimization Engineer
Taglynk
2 - 5 years
Bengaluru
Posted: 08/01/2026
Job Description
Role
Were hiring a GPU Optimization Engineer who understands GPUs at a deep, architectural level someone who knows exactly how to squeeze every last millisecond out of a model, what GPU constraints matter, and how to restructure models for real-world inference performance. Youll work across CUDA kernels, model graph optimizations, hardware-specific tuning, and porting models across GPU architectures. Your work directly impacts the latency, throughput, and reliability of smallests real-time speech models.
What Youll Do
- Optimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardware
- Profile models end-to-end to identify GPU bottlenecks memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraints
- Design and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sections
- Perform operator fusion, graph optimization, and kernel-level scheduling improvements
- Tune models to fit GPU memory limits while maintaining quality
- Benchmark and calibrate inference across NVIDIA, AMD, and potentially emerging accelerators
- Port models across GPU chipsets (NVIDIA AMD / edge GPUs / new compute backends)
- Work with TensorRT, ONNX Runtime, and custom runtimes for deployment
- Partner with the research and infra teams to ensure the entire stack is optimized for real-time workloads
Requirements
- Strong understanding of GPU architecture SMs, warps, memory hierarchy, occupancy tuning
- Hands-on experience with CUDA , kernel writing, and kernel-level debugging
- Experience with kernel fusion and model graph optimizations
- Familiarity with TensorRT, ONNX, Triton, tinygrad, or similar inference engines
- Strong proficiency in PyTorch and Python
- Deep understanding of model architectures (transformers, convs, RNNs, attention, diffusion blocks)
- Experience profiling GPU workloads using Nsight, nvprof, or similar tools
- Strong problem-solving abilities with a performance-first mindset
Great to Have
- Experience with quantization (INT8, FP8, hybrid formats)
- Experience with audio/speech models (ASR, TTS, SSL, vocoders)
- Contributions to open-source GPU stacks or inference runtimes
- Published work related to systems-level model optimization
Who Will Succeed in This Role
Someone who:
- thinks in kernels, not just layers
- knows which optimizations are theoretical vs practically impactful
- understands GPU boundaries (memory, bandwidth, latency) and how to work around them
- is excited by the challenge of ultra-low latency and large-scale real-time inference
- loves debugging at the CUDA + model level
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
