🔔 FCM Loaded

AI Inference Engineer

G82 labs pvt ltd

2 - 5 years

Chennai

Posted: 10/01/2026

Getting a referral is 5x more effective than applying directly

Job Description

Inference & LLM Performance Engineer


Experience: 5+ years


About the Role


We are deploying large-scale LLM inference inside confidential environments with strict latency, throughput, and streaming requirements.


As Head of Inference & LLM Performance, you will be working with everything related to model execution speed, GPU efficiency, while working closely with the other teams to ensure security constraints are respected.


What Youll work with


* LLM inference stack (vLLM, TensorRT-LLM, or equivalent)

* Hugging Face model loading and integrity verification

* Token streaming semantics and batching strategies

* GPU scheduling and throughput optimization

* Latency, memory, and utilization targets


What Youll Build


* High-performance inference pipelines for large models

* Encrypted token streaming with minimal overhead

* Efficient batching across hundreds of concurrent users

* Production-grade inference behavior under TEE constraints


Ideal Background


* Deep experience with LLM inference (vLLM, etc.)

* Strong CUDA and GPU performance understanding

* Experience running large models in production

* Comfortable working under strict security constraints

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.