bout the Role

We are looking for an experienced MLOps Engineer to build, automate, and maintain end-to-end machine learning pipelines and production environments. The ideal candidate has strong experience with ML model deployment, workflow orchestration, CI/CD automation, cloud platforms, and scalable architecture for real-time or batch ML systems.

You will work closely with data scientists, ML engineers, and DevOps teams to ensure models are efficiently deployed, monitored, optimized, and continuously improved.

Key Responsibilities

1. ML Pipeline Development & Automation

Build and manage scalable ML pipelines for data preparation, training, validation, and deployment.
Create automated workflows using tools like Kubeflow, MLflow, Airflow, Vertex AI Pipelines, or SageMaker Pipelines .
Implement versioning of datasets, models, and experiments.

2. Model Deployment & Serving

Deploy ML models on cloud environments (AWS/GCP/Azure) or on-prem.
Implement real-time model serving using Docker, Kubernetes, KServe, TorchServe, TensorFlow Serving, or FastAPI .
Develop APIs for inference and integrate models into production systems.

3. CI/CD for ML (Continuous Integration & Delivery)

Build automated CI/CD pipelines for model training, packaging, and deployment.
Ensure safe rollouts with canary deployments, A/B tests, and rollback strategies.
Maintain Git-based workflows for code, model, and pipeline updates.

4. Monitoring, Observability & Maintenance

Implement end-to-end monitoring for model performance, drift detection, data quality, and metrics .
Set up logging and alerting using Prometheus, Grafana, ELK/EFK, CloudWatch .
Automate model retraining triggers based on performance thresholds or data drift.

5. Infrastructure Management

Build and maintain cloud-based ML infrastructure (compute, storage, networking).
Work with IaC tools like Terraform, CloudFormation, or Pulumi .
Optimize resource usage, GPU allocation, and cost efficiency.

6. Collaboration & Documentation

Work closely with data scientists to productionize notebooks and prototype models.
Convert experimental code into scalable, maintainable components.
Document workflows, architecture, pipeline steps, and best practices.

7. ML Governance, Versioning & Security

Implement model registries (MLflow, SageMaker Model Registry, Vertex AI Model Registry).
Ensure compliance with security, PII handling, privacy, and governance policies.
Manage secrets, credentials, and secure access for ML systems.

Required Skills & Qualifications

Technical Skills

Strong understanding of ML lifecycle , model deployment, and production ML.
Proficiency in Python , ML frameworks (PyTorch, TensorFlow, Scikit-learn).
Hands-on experience with Docker, Kubernetes, Helm charts .
Experience with MLflow, Kubeflow, Airflow, Jenkins, GitHub Actions, or Azure DevOps .
Cloud experience with AWS (SageMaker, ECS/EKS), GCP (Vertex AI), or Azure ML.
Knowledge of monitoring tools , APIs , REST/GraphQL , and microservices.
Familiarity with feature stores (Feast, Tecton) is a plus.

Soft Skills

Strong problem-solving and analytical mindset.
Excellent collaboration with DS/DE/DevOps teams.
Clear communication and documentation abilities.
Ability to work independently and handle fast-paced environments.

Preferred Qualifications

Experience with GPU-based training and model optimization.
Exposure to data engineering tools (Spark, Kafka, Databricks).
Familiarity with distributed training frameworks (Horovod, DeepSpeed).
Prior experience deploying LLM or deep learning models.

MLOps Engineer

TRDFIN Support Services Pvt Ltd

Job Description

Services you might be interested in

Improve Your Resume Today