Role Overview

We are looking for a Lead Site Reliability Engineer with 6-7 years of experience to drive reliability, observability, and incident management practices. The ideal candidate will have strong expertise in Grafana stack, production monitoring, and handling critical incidents in high-availability systems.

Key Responsibilities

Act as the Incident Commander during production outages, ensuring timely resolution and stakeholder communication
Lead incident response, triage, RCA (Root Cause Analysis), and postmortems
Build and enhance observability systems using Grafana (Prometheus, Loki, Tempo)
Define and manage SLIs, SLOs, and SLAs for critical services.
Develop and maintain monitoring, alerting, and dashboards for proactive issue detection.
Collaborate with Dev, Infra, and DB teams to improve system reliability and performance.
Drive automation and runbook creation to reduce manual intervention
Improve on-call processes and incident management workflows
Ensure high availability, scalability, and fault tolerance of systems

Required Skills

56 years of experience in Site Reliability Engineering / Production Support
Strong hands-on experience with Grafana stack (Prometheus, Loki, Tempo)
Solid understanding of monitoring, alerting, and observability principles
Experience in incident management and handling P1/P2 incidents
Knowledge of cloud platforms (AWS)
Experience with Linux systems and troubleshooting
Familiarity with Kubernetes / containerized environments
Strong scripting skills (Python / Bash)

Lead - Site Reliability Engineer

FundsIndia

Let experts apply while you prepare for interviews

Job Description

Services you might be interested in

We Search & Apply Jobs for You!