Inference & LLM Performance Engineer

Experience: 5+ years

About the Role

We are deploying large-scale LLM inference inside confidential environments with strict latency, throughput, and streaming requirements.

As Head of Inference & LLM Performance, you will be working with everything related to model execution speed, GPU efficiency, while working closely with the other teams to ensure security constraints are respected.

What Youll work with

* LLM inference stack (vLLM, TensorRT-LLM, or equivalent)

* Hugging Face model loading and integrity verification

* Token streaming semantics and batching strategies

* GPU scheduling and throughput optimization

* Latency, memory, and utilization targets

What Youll Build

* High-performance inference pipelines for large models

* Encrypted token streaming with minimal overhead

* Efficient batching across hundreds of concurrent users

* Production-grade inference behavior under TEE constraints

Ideal Background

* Deep experience with LLM inference (vLLM, etc.)

* Strong CUDA and GPU performance understanding

* Experience running large models in production

* Comfortable working under strict security constraints

AI Inference Engineer

G82 labs pvt ltd

Job Description

Services you might be interested in

Improve Your Resume Today