Job Opening: MLOps Engineer

Location: Bangalore

Employment Type: Full-time (Hybrid)

Role Overview

We are looking for an experienced MLOps Engineer to manage and optimize the complete machine learning lifecycle. The role involves supporting large-scale distributed training on on-premise GPU clusters and deploying optimized models into production-grade C++ inference systems.

Key Responsibilities:

High-Performance Inference & Deployment

Build and maintain pipelines to move models from Python research environments to production C++ systems
Optimize deep learning models (PyTorch/JAX/TensorFlow) for low-latency inference using quantization, operator fusion, TensorRT, or custom CUDA kernels
Ensure deterministic behavior and thread safety in real-time environments

Research Infrastructure & Compute

Manage and scale on-premise HPC/GPU clusters
Optimize scheduling and resource utilization using Slurm or Kubernetes
Troubleshoot distributed training issues (NCCL, InfiniBand, GPUDirect)

Data & Feature Engineering Pipelines

Design high-throughput data loaders for large-scale historical datasets
Implement feature stores ensuring consistency between offline training and online inference

Reliability & Monitoring

Develop real-time monitoring, drift detection, and automated safety mechanisms
Build CI/CD workflows for code, data, and model artifacts with strict reproducibility

Required Qualifications

3+ years of experience in MLOps, Systems Engineering, or SRE
Strong proficiency in Python and working knowledge of C++
Hands-on experience with Docker and Kubernetes or Slurm
Strong understanding of Linux internals (memory, CPU pinning, NUMA, networking)
Experience with PyTorch or TensorFlow and model serialization (ONNX, TorchScript)

Preferred Skills

Experience managing large-scale on-premise GPU clusters (A100/H100)
CUDA kernel optimization or experience with Triton/TVM
Exposure to time

MLOps Engineer

Meril

Job Description

Services you might be interested in

Improve Your Resume Today