About The Role:

As a Model Inference Engineer, you will bridge the gap between model training and production deployment. You will take high-performance checkpoints from our Training Engineers and transform them into optimized, production-ready artifacts. Your mission is to architect, build, and rigorously test inference servers that deliver our Voice AI capabilities across both real-time streaming and high-throughput batch scenarios.

You will also play a key role in hardware-software co-optimization, selecting the right computer profiles and implementing scaling strategies to balance high-fidelity audio quality with cost-efficient, reliable production delivery.

What youll be Responsible for:

Transform trained checkpoints into high-performance artifacts using TensorRT, ONNX, or TVM. Implement quantization strategies (FP16, INT8, FP8) to balance precision and performance.
Architect and maintain inference servers using Triton Inference Server or vLLM. Implement efficient request handling through dynamic batching and streaming protocols (gRPC, WebSockets).
Profile and optimize model performance at the kernel level. Select and tune compute profiles across various NVIDIA GPU architectures (T4, L4, A100, H100) to maximize cost-efficiency.
Design and execute rigorous performance tests to measure latency (TTFC), throughput, and memory usage. Ensure optimized models maintain the required acoustic fidelity and accuracy.
Partner with Training Engineers to define export-friendly architectures and provide feedback on model performance in production-like environments.

What we are looking for, in you:

Must have:

Deep practical experience with model serving frameworks such as vLLM, Triton Inference Server, and Ollama.
Strong experience with model acceleration and runtime frameworks including TensorRT, TensorRT-LLM, and ONNX Runtime.
Ability to optimize inference performance through batching, quantization, GPU utilization, and latency tuning for large-scale model serving.
Ability to profile and identify bottlenecks across the entire stackfrom Python/C++ code to GPU kernels and memory bandwidth.

Good to have:

Experience writing or optimizing kernels using CUDA (C++) or Triton (Python) to accelerate non-standard operators.
Familiarity with Apache TVM, Kubernetes, Docker, and managing GPU clusters for large-scale inference deployment.

Required:

57 years of industry experience in machine learning model optimization, inference systems, or ML infrastructure engineering.
BE/BTech/ME/MTech/PhD in Computer Science, Artificial Intelligence, Machine Learning, or a related field preferred.
Strong proficiency in C++ and Python for building high-performance machine learning and inference systems.
Solid understanding of NVIDIA GPU architectures (e.g., Ampere, Hopper) and CUDA programming concepts for accelerated computing.
Experience working with Linux-based environments, including system-level debugging and performance tuning.
Familiarity with networking protocols and APIs such as gRPC, WebSockets, and HTTP/2 for real-time inference services.
Proficiency with version control systems such as Git, and experience with collaborative software development workflows.

Why join us?

Impactful Work: Play a pivotal role in safeguarding Tanla's assets, data, and reputation in the industry.
Tremendous Growth Opportunities: Be part of a rapidly growing company in the telecom and CPaaS space, with opportunities for professional development.
Innovative Environment: Work alongside a world-class team in a challenging and fun environment, where innovation is celebrated.

Tanla is an equal opportunity employer. We champion diversity and are committed to creating an inclusive environment for all employees.

www.tanla.com

Sr Inference Engineer

Tanla Platforms Limited

Job Description

Services you might be interested in

Improve Your Resume Today