Opportunity

We are looking for SREs who want to define what reliability means for the next generation of industrial software. Defining SLIs/SLOs, building observability platforms, and establishing incident management processes.

Responsibilities

Define and implement SLI/SLO frameworks for complex engineering systems across manufacturing and industrial clients
Design and deploy observability platforms using Prometheus, Grafana, and Datadog
Establish incident management processes and lead blameless post-mortems
Implement chaos engineering practices to proactively identify system weaknesses
Drive toil elimination through automation and platform improvements
Build reliability engineering capabilities within the practice and client organisations

Essential Skills

SLI/SLO definition and implementation at enterprise scale
Observability: Prometheus, Grafana, Datadog, New Relic
Incident management and post-mortem facilitation
Chaos engineering: Gremlin, Chaos Monkey, Litmus
Python testing for reliability validation and automated runbooks
Automation and scripting: Python, Go, Bash
Cloud platforms: AWS, Azure, GCP

Experience

510 years in SRE or Production Engineering roles with experience in enterprise or industrial environments

Site Reliability Engineer

PwC India

Job Description

Services you might be interested in

Improve Your Resume Today