About the Role

As a Principal SRE,you will be a key member of the CFL Platform Engineering and Operations team ,you will lead reliability engineering for AI-powered platforms supporting LLM applications, AI gateways, and enterprise-scale services across finance, credit, collections, and document systems. You will design and implement observability and incident response frameworks, scale high-performance infrastructure, and champion SRE best practices to support secure, automated, and resilient systems.

What Youll Do

Architect observability and incident response pipelines for LLM, API, and backend systems
Define SLAs, SLIs, alerts, and dashboards for latency, throughput, and availability
Lead high-severity incident response, root cause analysis, and system recovery
Collaborate with AI, Platform, and Security teams to enforce operational guardrails
Implement automation-first strategies using GitLab CI/CD, Terraform, and deployment tooling
Guide infrastructure tuning, capacity planning, and cost optimization
Drive monitoring across hybrid clouds using Prometheus, Grafana, Splunk, OpenTelemetry
Support AIOps, model observability, policy enforcement, and audit readiness
Mentor senior SREs and foster a high-ownership, technical excellence culture

What Youll Bring

Bachelor's or Masters in Computer Science, Engineering, or related field
7-12 years in SRE, infrastructure, or platform roles in distributed systems
Strong experience in incident management, AI/ML observability, and performance engineering
Hands-on expertise with OpenAI APIs, inference systems, AI gateways, and secure APIs
Proficiency in Python, Java, Bash/PowerShell, YAML
Deep knowledge of CI/CD workflows, GitLab pipelines, and SDLC processes
Experience with Kafka, HAProxy, RabbitMQ, Oracle DB, MongoDB
Proven success in scaling cloud-native platforms on Azure, AWS, GCP, or OCI
Familiarity with AIOps, latency scoring, policy validation, and secure AI operations
Background in compliance, governance, and enterprise risk management for AI systems
Advanced debugging skills across data, infrastructure, networking, and app layers
Leadership in chaos engineering, SLO-based operations, and system resilience

Must Have Skills

Application & Microservice: Java, Spring boot, API & Service Design
Any CI/CD Tools: Gitlab Pipeline/Test Automation/GitHub Actions/ Jenkins /Circle CI
App Platform: Docker & Containers (Kubernetes)
Any Databases: SQL & NOSQL (Cassandra/Oracle/Snowflake/MongoDB)
Any Messaging: Kafka, Rabbit MQ
Any Observability/Monitoring: Splunk/ Grafana/ Open Telemetry /ELK Stack/ Datadog/ New Relic/ Prometheus)
Incident/Change/Problem Management

Nice To Have

Compliance-aligned continuity planning (PCI, SOX)
Error-budget pacts with product/org leadership
Executive Incident/Change/Problem /risk reporting
Observability cost vs coverage trade-offs
Org-wide reliability governance strategy

Principal Engineer, Site Reliability

TMUS Global Solutions

Job Description

Services you might be interested in

Improve Your Resume Today