Login Sign Up

Lead - Site Reliability Engineer

FundsIndia

6 - 7 years

Chennai

Posted: 29/06/2026

Job Description

Role Overview


We are looking for a Lead Site Reliability Engineer with 6-7 years of experience to drive reliability, observability, and incident management practices. The ideal candidate will have strong expertise in Grafana stack, production monitoring, and handling critical incidents in high-availability systems.


Key Responsibilities

  • Act as the Incident Commander during production outages, ensuring timely resolution and stakeholder communication
  • Lead incident response, triage, RCA (Root Cause Analysis), and postmortems
  • Build and enhance observability systems using Grafana (Prometheus, Loki, Tempo)
  • Define and manage SLIs, SLOs, and SLAs for critical services.
  • Develop and maintain monitoring, alerting, and dashboards for proactive issue detection.
  • Collaborate with Dev, Infra, and DB teams to improve system reliability and performance.
  • Drive automation and runbook creation to reduce manual intervention
  • Improve on-call processes and incident management workflows
  • Ensure high availability, scalability, and fault tolerance of systems


Required Skills

  • 56 years of experience in Site Reliability Engineering / Production Support
  • Strong hands-on experience with Grafana stack (Prometheus, Loki, Tempo)
  • Solid understanding of monitoring, alerting, and observability principles
  • Experience in incident management and handling P1/P2 incidents
  • Knowledge of cloud platforms (AWS)
  • Experience with Linux systems and troubleshooting
  • Familiarity with Kubernetes / containerized environments
  • Strong scripting skills (Python / Bash)

Services you might be interested in

We Search & Apply Jobs for You!

Our team scans through 1000s of opportunities and applies to roles best suited to your profile

Save 100+ hours and focus on what matters - cracking interviews and landing offers.