DevOps Manager
Deutsche Telekom Digital Labs
5 - 10 years
Gurugram
Posted: 01/03/2026
Job Description
Role Overview
We are seeking a DevOps Engineering Manager to lead Cloud and Platform engineering
for AI-first teams, operating at the intersection of large-scale containerized production
systems and next-generation Agentic AI and LLM deployments.
This role is responsible for building and operating highly Reliable, Secure, and Scalable
platforms that support mature microservices-based workloads while enabling rapid
experimentation and production rollout of Agentic AI systems. You will work closely with
AI/ML, platform, and product teams across India and Europe to operationalize AI
solutions at scale.
Key Responsibilities
Define and own the cloud and platform architecture for large-scale containerized
microservices and Agentic AI / LLM workloads, ensuring scalability, reliability, and cost
efficiency.
Lead CI/CD platform engineering, enabling automated build, test, security scanning, and
deployment for backend services, React-based web applications, and mobile app backends
Enable production-grade AI platforms, supporting agent frameworks, vector databases,
prompt pipelines, and inference
Define Infrastructure as code standards, cloud account structures, networking, and
environment provisioning across AWS and secondary clouds.
Implement and enforce SRE practices: define SLIs/SLOs, error budgets, capacityand reliability
targets, and lead incident response and post-incident reviews.
Ensure end-to-end observability across services and AI workloads, including logs, metrics,
traces, model performance, and cost visibility
Embed security, compliance, and governance by design, including IAM, secrets management,
network security, vulnerability management, and AI-specific controls.
Make informed build vs. buy decisions, evaluate emerging cloud and AI infrastructure
technologies, and drive continuous platform modernization.
Must Have
10+ years of experience in DevOps / Cloud / Platform Engineering, including
people management and technical leadership
Deep hands-on expertise with AWS, with working exposure to GCP and Azure in
multi-cloud or hybrid environments
Proven experience operating large-scale, production-grade containerized
workloads, with strong understanding of high availability, fault tolerance, and capacity planning
in global teams
Practical experience supporting AI/ML or LLM workloads in production environments
Strong expertise in Kubernetes and Docker, including cluster operations, workload isolation,
ingress, service meshes, and deployment strategies
Advanced experience with Infrastructure as Code for cloud provisioning, networking, security
controls, and environment standardization across multiple stages
Solid understanding of observability and reliability engineering, including metrics, logging,
tracing, alerting, and defining SLIs/SLOs for distributed systems and AI services
Hands-on exposure with cloud security and compliance practices, including IAM design,
secrets management, vulnerability scanning, and secure deployment patternsespecially for
AI platforms
Knowledge of cloud cost optimization (FinOps), especially for AI workloads
Background in strong product-based organizations solving real customer-facing problems
Leadership and Mindset
Strong AI-first mindset with curiosity and adaptability to turn rapid AI innovation in to stable
production systems.
Strategic thinker with hands-on technical depth
Excellent communication and collaboration skills in global, distributed teams
Ownership-driven leader who builds accountable teams and fosters a culture of reliability,
automation, and continuous improvement
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
