ABOUT THE ROLE:

The TDD's multi-modal expertise capture pipeline and Graph RAG branch of the Hybrid architecture require a dedicated data engineer who builds and operates the document ingestion pipeline, chunking engine, vector database indexing, retrieval/re-ranking pipeline, golden dataset management system, and data pool routing infrastructure. This role executes the data architecture designs produced by the IC6 Data Engineer and operates the data infrastructure throughout the engagement.

Project Context:

You will build and operate the data pipelines that power the Tradecraft Evaluation Platform. You will implement the batch extraction pipeline that ingests 11 years of historical expertise artifacts (transcripts, memos, reports, code), the Graph RAG pipeline that indexes tradecraft in Weaviate for retrieval-augmented evaluation, the golden dataset management system that tracks 300500 validated rows with full lineage, and the data pool routing infrastructure that enforces physical separation between eval, training, and holdout datasets.

KEY RESPONSIBILITIES:

A. STANDARD RESPONSIBILITIES:

Build and maintain data ingestion pipelines that process multiple file formats reliably at scale
Operate and optimize database systems (relational, vector, cache) for performance and reliability
Implement data quality checks, validation rules, and monitoring for pipeline health
Write and maintain data transformation logic in Python with comprehensive test coverage

B. PROJECT-SPECIFIC RESPONSIBILITIES:

Build the batch extraction pipeline that ingests historical expertise artifacts in multiple formats (PDF, audio, video, text) using Apache Tika and PyMuPDF for document parsing, stores them in S3 with metadata in PostgreSQL, and routes them through the chunking engine
Implement the semantic chunking engine that processes different document types with type-specific strategies (semantic chunking for transcripts, structural chunking for reports, function-level chunking for code) and generates embeddings for vector database indexing
Build and operate the Weaviate vector database indexing pipeline, including schema configuration, batch embedding ingestion, metadata tagging, and retrieval/re-ranking pipeline with hybrid search (vector + keyword)
Implement the golden dataset management system in PostgreSQL with versioning (dataset version row versions with immutable snapshots), pool routing (eval/training/holdout), and full lineage tracking (source_artifact extraction_run candidate_row validation_event golden_row)
Build the data pool routing service that enforces physical separation between eval, training, and holdout pools using separate S3 buckets and PostgreSQL schemas, with audit logging of all routing decisions
Implement the synthetic augmentation data pipeline that takes validated human examples, generates synthetic variants via LLM, and routes them through the human review queue with "syntheticpending validation" status
Build the Graph Context Retriever that extracts entity subgraphs from the knowledge graph via read-only API for the evaluation scenario context
Implement the cost tracking data pipeline using TimescaleDB to record per-evaluation-run, per-model LLM API costs with attribution

REQUIRED SKILLS & EXPERIENCE:

[STANDARD] 710 years of experience in data engineering with production pipeline development
[STANDARD] Expert-level Python proficiency (3.11+) for data pipeline development, including async patterns
[PROJECT-SPECIFIC] Hands-on experience building and operating Weaviate (or Pinecone/Qdrant) vector database pipelines, including schema design, batch ingestion, and retrieval optimization
[PROJECT-SPECIFIC] Experience with document processing pipelines using Apache Tika, PyMuPDF, or equivalent for multi-format ingestion (PDF, DOCX, audio transcription output)
[STANDARD] Expert-level PostgreSQL experience, including schema design, indexing, triggers, and operational management
[PROJECT-SPECIFIC] Experience implementing data versioning and lineage tracking systems for ML datasets
[STANDARD] Experience with S3 (or equivalent object storage) for large-scale document and artifact storage
[STANDARD] Experience with Redis for caching and Celery for async task orchestration

Experience Requirements:

YEARS OF EXPERIENCE: 710 years in data engineering
SENIORITY LEVEL: Senior
TYPICAL BACKGROUND: Senior data engineer at an AI/ML platform company; data pipeline engineer at a search/retrieval company; backend engineer who transitioned into data engineering for NLP/LLM systems; data engineer at a risk/compliance technology company
COMPLEXITY INDICATORS: Has built pipelines processing 10K+ documents in multiple formats; has operated vector databases with 1M+ embeddings; has implemented data versioning systems for ML datasets; has built data separation infrastructure for compliance
LEADERSHIP / OWNERSHIP EXPECTATIONS: Owns all data pipeline implementation and operations; makes independent decisions on pipeline design within the architecture defined by IC6 Data Engineer; operates data infrastructure without dedicated DBA support

SUCCESS INDICATORS:

Has built and deployed a production RAG data pipeline with vector database indexing and retrieval achieving >70% precision
Has implemented a golden dataset or ML evaluation dataset management system with versioning and lineage
Has built multi-format document ingestion pipelines processing 10K+ documents reliably
Has implemented physical data separation for a compliance-sensitive system

Project-Specific Skills and Domain Knowledge

Must-Have:

Experience implementing semantic chunking strategies for different document types (transcripts, reports, code) with measurable retrieval quality impact
Experience building data pipelines that integrate with LLM APIs for extraction and augmentation tasks
Experience implementing physical data separation (separate storage, separate schemas) for compliance-sensitive ML datasets
Experience with TimescaleDB or equivalent time-series databases for metrics and cost tracking

PREFERRED QUALIFICATIONS

Experience with knowledge graph data models and entity resolution pipelines
Experience operating data infrastructure in FedRAMP-compatible environments
AWS Data Analytics Specialty or equivalent certification
Experience with OpenAI Whisper for audio transcription pipeline integration
Experience with embedding model selection and evaluation for RAG systems
Contributions to open-source data engineering tools

Project-Specific Skills and Domain Knowledge

Strongly Preferred:

Experience with graph database APIs for subgraph extraction (Neo4j, Neptune, or similar)
Experience with FastAPI for building data service APIs
Familiarity with NLP preprocessing pipelines (tokenization, NER, text normalization)
Experience with PII detection and anonymization in data pipelines

Trade-Craft Experience A Significant Plus

Candidates with backgrounds in intelligence analysis, signals intelligence, law enforcement data fusion, or related trade-craft disciplines are strongly encouraged to apply. Understanding of link analysis, entity disambiguation under adversarial conditions, handling classified or compartmentalised data, and mission-driven product constraints will set you apart.

Remote Data Engineer - 78243

Turing

Let experts apply while you prepare for interviews

Job Description

Services you might be interested in

We Search & Apply Jobs for You!