Login Sign Up

Remote Data Engineer - 78243

Turing

10 - 12 years

Mumbai

Posted: 17/05/2026

Getting a referral is 5x more effective than applying directly

Job Description

ABOUT THE ROLE:

The TDD's multi-modal expertise capture pipeline and Graph RAG branch of the Hybrid architecture require a dedicated data engineer who builds and operates the document ingestion pipeline, chunking engine, vector database indexing, retrieval/re-ranking pipeline, golden dataset management system, and data pool routing infrastructure. This role executes the data architecture designs produced by the IC6 Data Engineer and operates the data infrastructure throughout the engagement.


Project Context:

You will build and operate the data pipelines that power the Tradecraft Evaluation Platform. You will implement the batch extraction pipeline that ingests 11 years of historical expertise artifacts (transcripts, memos, reports, code), the Graph RAG pipeline that indexes tradecraft in Weaviate for retrieval-augmented evaluation, the golden dataset management system that tracks 300500 validated rows with full lineage, and the data pool routing infrastructure that enforces physical separation between eval, training, and holdout datasets.


KEY RESPONSIBILITIES:

A. STANDARD RESPONSIBILITIES:

  • Build and maintain data ingestion pipelines that process multiple file formats reliably at scale
  • Operate and optimize database systems (relational, vector, cache) for performance and reliability
  • Implement data quality checks, validation rules, and monitoring for pipeline health
  • Write and maintain data transformation logic in Python with comprehensive test coverage


B. PROJECT-SPECIFIC RESPONSIBILITIES:

  • Build the batch extraction pipeline that ingests historical expertise artifacts in multiple formats (PDF, audio, video, text) using Apache Tika and PyMuPDF for document parsing, stores them in S3 with metadata in PostgreSQL, and routes them through the chunking engine
  • Implement the semantic chunking engine that processes different document types with type-specific strategies (semantic chunking for transcripts, structural chunking for reports, function-level chunking for code) and generates embeddings for vector database indexing
  • Build and operate the Weaviate vector database indexing pipeline, including schema configuration, batch embedding ingestion, metadata tagging, and retrieval/re-ranking pipeline with hybrid search (vector + keyword)
  • Implement the golden dataset management system in PostgreSQL with versioning (dataset version row versions with immutable snapshots), pool routing (eval/training/holdout), and full lineage tracking (source_artifact extraction_run candidate_row validation_event golden_row)
  • Build the data pool routing service that enforces physical separation between eval, training, and holdout pools using separate S3 buckets and PostgreSQL schemas, with audit logging of all routing decisions
  • Implement the synthetic augmentation data pipeline that takes validated human examples, generates synthetic variants via LLM, and routes them through the human review queue with "syntheticpending validation" status
  • Build the Graph Context Retriever that extracts entity subgraphs from the knowledge graph via read-only API for the evaluation scenario context
  • Implement the cost tracking data pipeline using TimescaleDB to record per-evaluation-run, per-model LLM API costs with attribution


REQUIRED SKILLS & EXPERIENCE:

  • [STANDARD] 710 years of experience in data engineering with production pipeline development
  • [STANDARD] Expert-level Python proficiency (3.11+) for data pipeline development, including async patterns
  • [PROJECT-SPECIFIC] Hands-on experience building and operating Weaviate (or Pinecone/Qdrant) vector database pipelines, including schema design, batch ingestion, and retrieval optimization
  • [PROJECT-SPECIFIC] Experience with document processing pipelines using Apache Tika, PyMuPDF, or equivalent for multi-format ingestion (PDF, DOCX, audio transcription output)
  • [STANDARD] Expert-level PostgreSQL experience, including schema design, indexing, triggers, and operational management
  • [PROJECT-SPECIFIC] Experience implementing data versioning and lineage tracking systems for ML datasets
  • [STANDARD] Experience with S3 (or equivalent object storage) for large-scale document and artifact storage
  • [STANDARD] Experience with Redis for caching and Celery for async task orchestration


Experience Requirements:

  • YEARS OF EXPERIENCE: 710 years in data engineering
  • SENIORITY LEVEL: Senior
  • TYPICAL BACKGROUND: Senior data engineer at an AI/ML platform company; data pipeline engineer at a search/retrieval company; backend engineer who transitioned into data engineering for NLP/LLM systems; data engineer at a risk/compliance technology company
  • COMPLEXITY INDICATORS: Has built pipelines processing 10K+ documents in multiple formats; has operated vector databases with 1M+ embeddings; has implemented data versioning systems for ML datasets; has built data separation infrastructure for compliance
  • LEADERSHIP / OWNERSHIP EXPECTATIONS: Owns all data pipeline implementation and operations; makes independent decisions on pipeline design within the architecture defined by IC6 Data Engineer; operates data infrastructure without dedicated DBA support


SUCCESS INDICATORS:

  • Has built and deployed a production RAG data pipeline with vector database indexing and retrieval achieving >70% precision
  • Has implemented a golden dataset or ML evaluation dataset management system with versioning and lineage
  • Has built multi-format document ingestion pipelines processing 10K+ documents reliably
  • Has implemented physical data separation for a compliance-sensitive system


Project-Specific Skills and Domain Knowledge


Must-Have:

  • Experience implementing semantic chunking strategies for different document types (transcripts, reports, code) with measurable retrieval quality impact
  • Experience building data pipelines that integrate with LLM APIs for extraction and augmentation tasks
  • Experience implementing physical data separation (separate storage, separate schemas) for compliance-sensitive ML datasets
  • Experience with TimescaleDB or equivalent time-series databases for metrics and cost tracking


PREFERRED QUALIFICATIONS

  • Experience with knowledge graph data models and entity resolution pipelines
  • Experience operating data infrastructure in FedRAMP-compatible environments
  • AWS Data Analytics Specialty or equivalent certification
  • Experience with OpenAI Whisper for audio transcription pipeline integration
  • Experience with embedding model selection and evaluation for RAG systems
  • Contributions to open-source data engineering tools

Project-Specific Skills and Domain Knowledge


Strongly Preferred:

  • Experience with graph database APIs for subgraph extraction (Neo4j, Neptune, or similar)
  • Experience with FastAPI for building data service APIs
  • Familiarity with NLP preprocessing pipelines (tokenization, NER, text normalization)
  • Experience with PII detection and anonymization in data pipelines


Trade-Craft Experience A Significant Plus

Candidates with backgrounds in intelligence analysis, signals intelligence, law enforcement data fusion, or related trade-craft disciplines are strongly encouraged to apply. Understanding of link analysis, entity disambiguation under adversarial conditions, handling classified or compartmentalised data, and mission-driven product constraints will set you apart.

Services you might be interested in

We Search & Apply Jobs for You!

Our team scans through 1000s of opportunities and applies to roles best suited to your profile

Save 100+ hours and focus on what matters - cracking interviews and landing offers.