🔔 FCM Loaded

AI Engineer – Multimodal

Granules India Limited

2 - 5 years

Hyderabad

Posted: 05/02/2026

Getting a referral is 5x more effective than applying directly

Job Description

About the Company :

Granules is a fully integrated pharmaceutical manufacturer specializing in Active Pharmaceutical Ingredients (APIs), Pharmaceutical Formulation Intermediates (PFIs), and Finished Dosages (FDs), with operations in over 80 countries and a focus on large-scale pharmaceutical manufacturing. Founded in 1984 and headquartered in Hyderabad, India, the company is publicly traded and employs 5,001 to 10,000 people.


About the Role:

We are hiring an AI Engineer Multimodal to design and build real-time multimodal/omni AI systems that generate audio, video, and language for conversational, human-like interfaces. The role focuses on developing models that tightly couple speech, visual behavior, and language to enable natural, low-latency interactions.

You will work at the intersection of conversational AI, neural audio, and audio-visual generation, contributing both foundational research and production-ready systems. This is a hands-on role with strong ownership over technical direction.


Responsibilities:

  • Research and develop multimodal/omni generation models for conversational systems, including neural avatars, talking-heads, and audio-visual outputs.
  • Build and fine-tune expressive neural audio / TTS systems, incorporating prosody, emotion, and non-verbal cues.
  • Design and operate real-time, streaming inference pipelines optimized for low latency and natural turn-taking.
  • Experiment with and apply diffusion-based models (DDPMs, LDMs) and other generative approaches for audio, image, or video generation.
  • Develop models that align conversation flow with verbal and non-verbal behavior across modalities.
  • Collaborate with applied ML and engineering teams to transition research into production-grade systems.
  • Track, evaluate, and apply emerging research in multimodal and generative modeling.


Qualifications:

  • Masters or PhD (or equivalent hands-on experience) in ML, AI, Computer Vision, Speech, or a related field.
  • 48 years of hands-on experience in applied AI/ML research or engineering, with a strong focus on multimodal and generative systems.


Required Skills:

  • Strong experience modeling human behavior and generation, including facial expressions, affect, or speech, preferably in conversational or interactive settings.
  • Deep understanding of sequence modeling across video, audio, and language domains.
  • Strong foundation in deep learning, including Transformers, diffusion models, and practical training techniques.
  • Familiarity with large-scale model training, including LLMs and/or vision-language models (VLMs).
  • Excellent programming skills in PyTorch, with hands-on experience in GPU-based training and inference.
  • Proven experience deploying and operating real-time or streaming AI systems in production.
  • Strong intuition for human-like speech and behavior generation, including diagnosing and improving unnatural outputs.


Nice to Have:

  • Experience with long-form audio or video generation.
  • Exposure to 3D graphics, Gaussian splatting, or large-scale training pipelines.
  • Familiarity with production ML or software engineering best practices.
  • Research publications in respected venues (e.g., CVPR, NeurIPS, ICASSP, BMVC).


Equal Opportunity Statement:

We are committed to diversity and inclusivity in our hiring practices.

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.