About the Company :

Granules is a fully integrated pharmaceutical manufacturer specializing in Active Pharmaceutical Ingredients (APIs), Pharmaceutical Formulation Intermediates (PFIs), and Finished Dosages (FDs), with operations in over 80 countries and a focus on large-scale pharmaceutical manufacturing. Founded in 1984 and headquartered in Hyderabad, India, the company is publicly traded and employs 5,001 to 10,000 people.

About the Role:

We are hiring an AI Engineer Multimodal to design and build real-time multimodal/omni AI systems that generate audio, video, and language for conversational, human-like interfaces. The role focuses on developing models that tightly couple speech, visual behavior, and language to enable natural, low-latency interactions.

You will work at the intersection of conversational AI, neural audio, and audio-visual generation, contributing both foundational research and production-ready systems. This is a hands-on role with strong ownership over technical direction.

Responsibilities:

Research and develop multimodal/omni generation models for conversational systems, including neural avatars, talking-heads, and audio-visual outputs.
Build and fine-tune expressive neural audio / TTS systems, incorporating prosody, emotion, and non-verbal cues.
Design and operate real-time, streaming inference pipelines optimized for low latency and natural turn-taking.
Experiment with and apply diffusion-based models (DDPMs, LDMs) and other generative approaches for audio, image, or video generation.
Develop models that align conversation flow with verbal and non-verbal behavior across modalities.
Collaborate with applied ML and engineering teams to transition research into production-grade systems.
Track, evaluate, and apply emerging research in multimodal and generative modeling.

Qualifications:

Masters or PhD (or equivalent hands-on experience) in ML, AI, Computer Vision, Speech, or a related field.
48 years of hands-on experience in applied AI/ML research or engineering, with a strong focus on multimodal and generative systems.

Required Skills:

Strong experience modeling human behavior and generation, including facial expressions, affect, or speech, preferably in conversational or interactive settings.
Deep understanding of sequence modeling across video, audio, and language domains.
Strong foundation in deep learning, including Transformers, diffusion models, and practical training techniques.
Familiarity with large-scale model training, including LLMs and/or vision-language models (VLMs).
Excellent programming skills in PyTorch, with hands-on experience in GPU-based training and inference.
Proven experience deploying and operating real-time or streaming AI systems in production.
Strong intuition for human-like speech and behavior generation, including diagnosing and improving unnatural outputs.

Nice to Have:

Experience with long-form audio or video generation.
Exposure to 3D graphics, Gaussian splatting, or large-scale training pipelines.
Familiarity with production ML or software engineering best practices.
Research publications in respected venues (e.g., CVPR, NeurIPS, ICASSP, BMVC).

Equal Opportunity Statement:

We are committed to diversity and inclusivity in our hiring practices.

AI Engineer – Multimodal

Granules India Limited

Job Description

Services you might be interested in

Improve Your Resume Today