Principal Engineer, Site Reliability
TMUS Global Solutions
7 - 12 years
Hyderabad
Posted: 31/01/2026
Job Description
About the Role
As a Principal SRE,you will be a key member of the CFL Platform Engineering and Operations team ,you will lead reliability engineering for AI-powered platforms supporting LLM applications, AI gateways, and enterprise-scale services across finance, credit, collections, and document systems. You will design and implement observability and incident response frameworks, scale high-performance infrastructure, and champion SRE best practices to support secure, automated, and resilient systems.
What Youll Do
- Architect observability and incident response pipelines for LLM, API, and backend systems
- Define SLAs, SLIs, alerts, and dashboards for latency, throughput, and availability
- Lead high-severity incident response, root cause analysis, and system recovery
- Collaborate with AI, Platform, and Security teams to enforce operational guardrails
- Implement automation-first strategies using GitLab CI/CD, Terraform, and deployment tooling
- Guide infrastructure tuning, capacity planning, and cost optimization
- Drive monitoring across hybrid clouds using Prometheus, Grafana, Splunk, OpenTelemetry
- Support AIOps, model observability, policy enforcement, and audit readiness
- Mentor senior SREs and foster a high-ownership, technical excellence culture
What Youll Bring
- Bachelor's or Masters in Computer Science, Engineering, or related field
- 7-12 years in SRE, infrastructure, or platform roles in distributed systems
- Strong experience in incident management, AI/ML observability, and performance engineering
- Hands-on expertise with OpenAI APIs, inference systems, AI gateways, and secure APIs
- Proficiency in Python, Java, Bash/PowerShell, YAML
- Deep knowledge of CI/CD workflows, GitLab pipelines, and SDLC processes
- Experience with Kafka, HAProxy, RabbitMQ, Oracle DB, MongoDB
- Proven success in scaling cloud-native platforms on Azure, AWS, GCP, or OCI
- Familiarity with AIOps, latency scoring, policy validation, and secure AI operations
- Background in compliance, governance, and enterprise risk management for AI systems
- Advanced debugging skills across data, infrastructure, networking, and app layers
- Leadership in chaos engineering, SLO-based operations, and system resilience
Must Have Skills
- Application & Microservice: Java, Spring boot, API & Service Design
- Any CI/CD Tools: Gitlab Pipeline/Test Automation/GitHub Actions/ Jenkins /Circle CI
- App Platform: Docker & Containers (Kubernetes)
- Any Databases: SQL & NOSQL (Cassandra/Oracle/Snowflake/MongoDB)
- Any Messaging: Kafka, Rabbit MQ
- Any Observability/Monitoring: Splunk/ Grafana/ Open Telemetry /ELK Stack/ Datadog/ New Relic/ Prometheus)
- Incident/Change/Problem Management
Nice To Have
- Compliance-aligned continuity planning (PCI, SOX)
- Error-budget pacts with product/org leadership
- Executive Incident/Change/Problem /risk reporting
- Observability cost vs coverage trade-offs
- Org-wide reliability governance strategy
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
