Principal Machine Learning Engineer - Multimodal AI & Inference

Bangalore

Founded in 2023,by Industry veterans HQ in California,US

We are revolutionizing sustainable AI compute through intuitive software with composable silicon

Overview:

You will design, optimize, and deploy large multimodal models (language, vision, audio, video) to run efficiently on a compact, high-performance AI appliance capable of supporting 100B+ parameter models at real-time speeds. Your mission is to deliver state-of-the-art multimodal inference locally through advanced model optimization, quantization, and system-level integration.

Key Responsibilities:

1. Model Integration & Porting

Optimize large-scale foundation models (e.g., Llama, gpt-oss, Whisper, HiDream, Qwen, Wan etc) for on-device inference.
Adapt pre-trained models for multimodal tasks (text, image, audio, video, or cross-modal reasoning).
Ensure seamless interoperability between modalities e.g., enabling the system to see, hear, and talk naturally.

2. Model Optimization for Edge Hardware

Quantize and compress large models (4-bit or mixed precision) while maintaining high accuracy and low latency.
Implement and benchmark inference runtimes using frameworks like Llama.cpp, Ollama, vLLM, ONNX etc.
Collaborate with hardware engineers to co-design model architectures optimized for the appliances compute fabric.

3. Inference Pipeline Development

Build and maintain scalable, high-throughput inference pipelines capable of handling concurrent multimodal requests (text, audio, image, video).
Implement token streaming, caching, and scheduling strategies for real-time responses.
Develop APIs for low-latency local inference accessible via a web interface.

4. Evaluation & Benchmarking

Profile and benchmark performance (throughput, latency, energy efficiency) of deployed models.
Run regression tests to validate numerical accuracy after quantization or pruning.
Define KPIs for multimodal model performance under real-world usage.

5. Research & Prototyping

Investigate emerging multimodal architectures and lightweight model variants for local deployment.
Prototype hybrid models that combine LLMs, diffusion models, and ASR/TTS pipelines for advanced multimodal applications.
Stay current on state-of-the-art inference frameworks, compression techniques, and multimodal learning trends.

Required Qualifications:

Strong background in deep learning and model deployment, with hands-on experience in PyTorch and/or TensorFlow.
Expertise in model optimization quantization, pruning, distillation, or mixed-precision inference.
Practical knowledge of inference engines (vLLM, llama.cpp, ONNX Runtime or similar).
Experience deploying large models locally or on edge devices with limited memory/compute constraints.
Familiarity with multimodal model architectures e.g., CLIP, Flamingo, LLaVA, or AudioGPT-style systems.
Strong software engineering skills (Python, C++, CUDA) and experience integrating models into production systems.
Understanding of GPU/accelerator utilization, memory bandwidth optimization, and distributed inference.

Preferred Qualifications:

experience-10+ years

Experience with model-parallel or tensor-parallel inference at scale.
Contributions to open-source inference frameworks or model serving systems.
Familiarity with hardware-aware training or co-optimization of neural networks and hardware.
Background in speech, vision, or multimodal ML research.
Track record of deploying models that run entirely offline or on embedded/edge systems.

Contact:

Uday

Mulya Technologies

"Mining The Knowledge Community"

Principal Machine Learning Engineer - Multimodal AI & Inference

Mulya Technologies

Job Description

Services you might be interested in

We Search & Apply Jobs for You!