Principal Machine Learning Engineer - Multimodal AI & Inference
Mulya Technologies
2 - 5 years
Bengaluru
Posted: 16/12/2025
Job Description
Principal Machine Learning Engineer - Multimodal AI & Inference
Bangalore
Founded in 2023,by Industry veterans HQ in California,US
- We are revolutionizing sustainable AI compute through intuitive software with composable silicon
Overview:
You will design, optimize, and deploy large multimodal models (language, vision, audio, video) to run efficiently on a compact, high-performance AI appliance capable of supporting 100B+ parameter models at real-time speeds. Your mission is to deliver state-of-the-art multimodal inference locally through advanced model optimization, quantization, and system-level integration.
Key Responsibilities:
1. Model Integration & Porting
- Optimize large-scale foundation models (e.g., Llama, gpt-oss, Whisper, HiDream, Qwen, Wan etc) for on-device inference.
- Adapt pre-trained models for multimodal tasks (text, image, audio, video, or cross-modal reasoning).
- Ensure seamless interoperability between modalities e.g., enabling the system to see, hear, and talk naturally.
2. Model Optimization for Edge Hardware
- Quantize and compress large models (4-bit or mixed precision) while maintaining high accuracy and low latency.
- Implement and benchmark inference runtimes using frameworks like Llama.cpp, Ollama, vLLM, ONNX etc.
- Collaborate with hardware engineers to co-design model architectures optimized for the appliances compute fabric.
3. Inference Pipeline Development
- Build and maintain scalable, high-throughput inference pipelines capable of handling concurrent multimodal requests (text, audio, image, video).
- Implement token streaming, caching, and scheduling strategies for real-time responses.
- Develop APIs for low-latency local inference accessible via a web interface.
4. Evaluation & Benchmarking
- Profile and benchmark performance (throughput, latency, energy efficiency) of deployed models.
- Run regression tests to validate numerical accuracy after quantization or pruning.
- Define KPIs for multimodal model performance under real-world usage.
5. Research & Prototyping
- Investigate emerging multimodal architectures and lightweight model variants for local deployment.
- Prototype hybrid models that combine LLMs, diffusion models, and ASR/TTS pipelines for advanced multimodal applications.
- Stay current on state-of-the-art inference frameworks, compression techniques, and multimodal learning trends.
Required Qualifications:
- Strong background in deep learning and model deployment, with hands-on experience in PyTorch and/or TensorFlow.
- Expertise in model optimization quantization, pruning, distillation, or mixed-precision inference.
- Practical knowledge of inference engines (vLLM, llama.cpp, ONNX Runtime or similar).
- Experience deploying large models locally or on edge devices with limited memory/compute constraints.
- Familiarity with multimodal model architectures e.g., CLIP, Flamingo, LLaVA, or AudioGPT-style systems.
- Strong software engineering skills (Python, C++, CUDA) and experience integrating models into production systems.
- Understanding of GPU/accelerator utilization, memory bandwidth optimization, and distributed inference.
Preferred Qualifications:
experience-10+ years
- Experience with model-parallel or tensor-parallel inference at scale.
- Contributions to open-source inference frameworks or model serving systems.
- Familiarity with hardware-aware training or co-optimization of neural networks and hardware.
- Background in speech, vision, or multimodal ML research.
- Track record of deploying models that run entirely offline or on embedded/edge systems.
Contact:
Uday
Mulya Technologies
"Mining The Knowledge Community"
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
