REQUIREMENTS:

Highly motivated, well-read, and a prodigy.

Deep experience with LLM application development prompt engineering, RAG pipelines, tool/

function calling, agent architectures

Hands-on experience with at least 2 of: OpenAI, Anthropic, Google Gemini, Groq, Mistral APIs

Strong understanding of embedding models, vector databases, and retrieval evaluation (precision,

recall, MRR, NDCG)

Experience building evaluation frameworks for AI systems not just accuracy metrics but

conversation-level quality assessment

Python proficiency with async programming (asyncio, aiohttp)

Familiarity with real-time audio/voice systems is a strong plus

Experience with LangChain/LangGraph agent patterns is a strong plus

ROLE/RESPONSIBILITIES:

Build and maintain the evaluation framework for voice and chat agent quality hallucination rate, tool selection accuracy, conversation success metrics, retrieval precision/recall, and end-to-end task

completion rates

Upgrade the RAG pipeline from basic FAISS flat index + bge-small-en-v1.5 to a production-grade

retrieval system with hybrid search (semantic + BM25), cross-encoder re-ranking, multi-document

support, chunk quality scoring, and dynamic index updates

Design and implement LLM routing intelligence choosing between 5 configured providers (OpenAI, Groq, Anthropic, Google Gemini, Mistral) based on query complexity, latency requirements, cost constraints, and tool-calling capability

Harden the guardrails system beyond current regex + Llama Guard 3: add topic boundary

enforcement, PII detection/redaction, hallucination detection on RAG responses, and output quality

scoring

Optimize voice pipeline latency end-to-end: STT TTFB, LLM TTFB, TTS TTFB, total round-trip. Profile each provider combination and tune VAD parameters (start/stop thresholds, confidence, min volume) per language

Build prompt engineering infrastructure version-controlled prompt registry, A/B testing framework for system prompts, and systematic optimization based on eval results

Develop conversation analytics: real-time sentiment tracking, intent classification, conversation

outcome scoring, topic drift detection, and customer satisfaction prediction

Implement human handoff intelligence frustration detection, repeated failure patterns, scope-

boundary detection, handoff summary generation

Tech stack you will work with

Pipecat AI (real-time voice pipeline with frame processors, VAD, barge-in)

LangChain + LangGraph (chat agent executor, tool calling, multi-agent orchestration)

FAISS + FastEmbed (vector search, local embeddings with BAAI/bge-small-en-v1.5)

Deepgram Nova-3, Google Cloud STT, AssemblyAI (speech-to-text)

ElevenLabs, Cartesia Sonic-3, Google TTS, Deepgram TTS (text-to-speech)

OpenAI GPT-4o, Groq Llama 3.3-70B, Anthropic Claude, Google Gemini 2.0 Flash, Mistral Small

(LLMs)

Llama Guard 3 via Groq (content safety), confusables library (homoglyph detection)

MCP (Model Context Protocol) via Pipedream for external tool integration

Compensation:

CTC 10L ++

AI & ML Engineer

Blessing Softtech

Let experts apply while you prepare for interviews

Job Description

Services you might be interested in

We Search & Apply Jobs for You!