Role Overview

We are seeking a DevOps Engineering Manager to lead Cloud and Platform engineering

for AI-first teams, operating at the intersection of large-scale containerized production

systems and next-generation Agentic AI and LLM deployments.

This role is responsible for building and operating highly Reliable, Secure, and Scalable

platforms that support mature microservices-based workloads while enabling rapid

experimentation and production rollout of Agentic AI systems. You will work closely with

AI/ML, platform, and product teams across India and Europe to operationalize AI

solutions at scale.

Key Responsibilities

Define and own the cloud and platform architecture for large-scale containerized

microservices and Agentic AI / LLM workloads, ensuring scalability, reliability, and cost

efficiency.

Lead CI/CD platform engineering, enabling automated build, test, security scanning, and

deployment for backend services, React-based web applications, and mobile app backends

Enable production-grade AI platforms, supporting agent frameworks, vector databases,

prompt pipelines, and inference

Define Infrastructure as code standards, cloud account structures, networking, and

environment provisioning across AWS and secondary clouds.

Implement and enforce SRE practices: define SLIs/SLOs, error budgets, capacityand reliability

targets, and lead incident response and post-incident reviews.

Ensure end-to-end observability across services and AI workloads, including logs, metrics,

traces, model performance, and cost visibility

Embed security, compliance, and governance by design, including IAM, secrets management,

network security, vulnerability management, and AI-specific controls.

Make informed build vs. buy decisions, evaluate emerging cloud and AI infrastructure

technologies, and drive continuous platform modernization.

Must Have

10+ years of experience in DevOps / Cloud / Platform Engineering, including

people management and technical leadership

Deep hands-on expertise with AWS, with working exposure to GCP and Azure in

multi-cloud or hybrid environments

Proven experience operating large-scale, production-grade containerized

workloads, with strong understanding of high availability, fault tolerance, and capacity planning

in global teams

Practical experience supporting AI/ML or LLM workloads in production environments

Strong expertise in Kubernetes and Docker, including cluster operations, workload isolation,

ingress, service meshes, and deployment strategies

Advanced experience with Infrastructure as Code for cloud provisioning, networking, security

controls, and environment standardization across multiple stages

Solid understanding of observability and reliability engineering, including metrics, logging,

tracing, alerting, and defining SLIs/SLOs for distributed systems and AI services

Hands-on exposure with cloud security and compliance practices, including IAM design,

secrets management, vulnerability scanning, and secure deployment patternsespecially for

AI platforms

Knowledge of cloud cost optimization (FinOps), especially for AI workloads

Background in strong product-based organizations solving real customer-facing problems

Leadership and Mindset

Strong AI-first mindset with curiosity and adaptability to turn rapid AI innovation in to stable

production systems.

Strategic thinker with hands-on technical depth

Excellent communication and collaboration skills in global, distributed teams

Ownership-driven leader who builds accountable teams and fosters a culture of reliability,

automation, and continuous improvement

DevOps Manager

Deutsche Telekom Digital Labs

Job Description

Services you might be interested in

Improve Your Resume Today