NLP Engineer
Molecular Connections
1 - 4 years
Bengaluru
Posted: 31/01/2026
Job Description
Position Overview
We are seeking a Document Processing Engineer to develop and maintain systems for extracting, parsing, and structuring data from scholarly documents including PDFs, Word files, and LaTeX documents. You will focus on bibliographic data extraction, reference parsing, and building robust document processing pipelines.
Location: (Remote/Hybrid/On-site)
Employment Type: Full-time
Experience Level: Junior to Mid-level (1-4 years)
Key Responsibilities:
Document Processing & Parsing
- Parse and extract text, tables, and metadata from PDF documents (especially scholarly/academic PDFs with complex layouts)
- Process Microsoft Word (.docx) documents and extract structured content
- Handle LaTeX files and convert them to processable formats
- Preserve document structure, layout, and formatting during extraction
Bibliographic Data Extraction
- Build and maintain systems for extracting references and citations from academic documents
- Develop regex patterns for detecting and parsing bibliographic information
- Extract metadata including authors, titles, journals, DOIs, publication dates, ISBNs
- Normalize and structure extracted bibliographic data
- Handle multiple citation formats (APA, MLA, Chicago, IEEE, etc.)
System Development
- Develop automated document processing pipelines
- Create reusable parsing modules and libraries
- Integrate document parsing tools and libraries (Grobid, pdfplumber, python-docx, etc.)
- Build APIs for document processing services
- Implement data validation and quality checks
Continuous Improvement
- Test and improve accuracy of extraction algorithms
- Handle edge cases and document format variations
- Optimize performance for large-scale document processing
- Stay updated on new document processing tools and techniques
Required Qualifications:
Education
- Bachelor's degree in Computer Science, Software Engineering, Data Science, or related field
- OR equivalent practical experience with demonstrable portfolio
Technical Skills - Must Have
- Strong Python programming (2+ years experience)
- PDF Processing Libraries: Experience with at least 2 of the following: pdfplumber, PyMuPDF (fitz), PyPDF2, pdfminer.six, camelot-py
- Word Document Processing: python-docx or similar libraries
- Regular Expressions (Regex): Advanced pattern matching for text extraction
- Text Processing: String manipulation, normalization, Unicode handling
- Version Control: Git/GitHub
Technical Skills - Good to Have
- LaTeX processing libraries (pylatexenc, plasTeX)
- Exposure to Grobid or similar scholarly document parsing tools
- Basic familiarity with Hugging Face models for document understanding
- Experience with LayoutLM, Donut, or other document AI models
- OCR tools (Tesseract)
- SQL for data storage and retrieval
- Docker for containerization
- REST API development
Domain Knowledge
- Understanding of academic/scholarly document structure
- Familiarity with bibliographic formats and citation styles
- Knowledge of document layout concepts (headers, columns, tables, figures)
- Understanding of PDF structure and complexities
Soft Skills
- Strong problem-solving abilities
- Attention to detail when handling complex documents
- Ability to work independently and in a team
- Good communication skills for technical discussions
- Analytical mindset for debugging extraction issues
Preferred Qualifications
- Experience working with academic/research documents
- Contributions to open-source document processing projects
- Knowledge of Natural Language Processing (NLP) basics
- Experience with document classification or semantic analysis
- Familiarity with reference management tools (Zotero, Mendeley, etc.)
- Understanding of XML/TEI formats
- Experience with data pipelines and ETL processes
- Prior work in academic publishing, digital libraries, or research institutions
Day-to-Day Activities
A typical day might include:
- Writing Python scripts to parse a new document format
- Debugging regex patterns for reference extraction
- Testing document processing pipeline on sample academic papers
- Integrating new parsing libraries into existing codebase
- Collaborating with team to improve extraction accuracy
- Reviewing and cleaning extracted bibliographic data
- Handling edge cases from complex PDF layouts
- Optimizing code performance for batch processing
Technical Assessment
Candidates will be asked to complete a practical coding assessment involving:
- PDF Parsing Task: Extract references from a scholarly PDF
- Regex Challenge: Write patterns to detect citations in various formats
- Layout Problem: Handle a multi-column academic paper correctly
- Code Quality: Clean, maintainable, well-documented co
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
