AI Engineer – Multimodal
Granules India Limited
2 - 5 years
Hyderabad
Posted: 05/02/2026
Job Description
About the Company :
Granules is a fully integrated pharmaceutical manufacturer specializing in Active Pharmaceutical Ingredients (APIs), Pharmaceutical Formulation Intermediates (PFIs), and Finished Dosages (FDs), with operations in over 80 countries and a focus on large-scale pharmaceutical manufacturing. Founded in 1984 and headquartered in Hyderabad, India, the company is publicly traded and employs 5,001 to 10,000 people.
About the Role:
We are hiring an AI Engineer Multimodal to design and build real-time multimodal/omni AI systems that generate audio, video, and language for conversational, human-like interfaces. The role focuses on developing models that tightly couple speech, visual behavior, and language to enable natural, low-latency interactions.
You will work at the intersection of conversational AI, neural audio, and audio-visual generation, contributing both foundational research and production-ready systems. This is a hands-on role with strong ownership over technical direction.
Responsibilities:
- Research and develop multimodal/omni generation models for conversational systems, including neural avatars, talking-heads, and audio-visual outputs.
- Build and fine-tune expressive neural audio / TTS systems, incorporating prosody, emotion, and non-verbal cues.
- Design and operate real-time, streaming inference pipelines optimized for low latency and natural turn-taking.
- Experiment with and apply diffusion-based models (DDPMs, LDMs) and other generative approaches for audio, image, or video generation.
- Develop models that align conversation flow with verbal and non-verbal behavior across modalities.
- Collaborate with applied ML and engineering teams to transition research into production-grade systems.
- Track, evaluate, and apply emerging research in multimodal and generative modeling.
Qualifications:
- Masters or PhD (or equivalent hands-on experience) in ML, AI, Computer Vision, Speech, or a related field.
- 48 years of hands-on experience in applied AI/ML research or engineering, with a strong focus on multimodal and generative systems.
Required Skills:
- Strong experience modeling human behavior and generation, including facial expressions, affect, or speech, preferably in conversational or interactive settings.
- Deep understanding of sequence modeling across video, audio, and language domains.
- Strong foundation in deep learning, including Transformers, diffusion models, and practical training techniques.
- Familiarity with large-scale model training, including LLMs and/or vision-language models (VLMs).
- Excellent programming skills in PyTorch, with hands-on experience in GPU-based training and inference.
- Proven experience deploying and operating real-time or streaming AI systems in production.
- Strong intuition for human-like speech and behavior generation, including diagnosing and improving unnatural outputs.
Nice to Have:
- Experience with long-form audio or video generation.
- Exposure to 3D graphics, Gaussian splatting, or large-scale training pipelines.
- Familiarity with production ML or software engineering best practices.
- Research publications in respected venues (e.g., CVPR, NeurIPS, ICASSP, BMVC).
Equal Opportunity Statement:
We are committed to diversity and inclusivity in our hiring practices.
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
