Lead - Site Reliability Engineer
FundsIndia
6 - 7 years
Chennai
Posted: 29/06/2026
Job Description
Role Overview
We are looking for a Lead Site Reliability Engineer with 6-7 years of experience to drive reliability, observability, and incident management practices. The ideal candidate will have strong expertise in Grafana stack, production monitoring, and handling critical incidents in high-availability systems.
Key Responsibilities
- Act as the Incident Commander during production outages, ensuring timely resolution and stakeholder communication
- Lead incident response, triage, RCA (Root Cause Analysis), and postmortems
- Build and enhance observability systems using Grafana (Prometheus, Loki, Tempo)
- Define and manage SLIs, SLOs, and SLAs for critical services.
- Develop and maintain monitoring, alerting, and dashboards for proactive issue detection.
- Collaborate with Dev, Infra, and DB teams to improve system reliability and performance.
- Drive automation and runbook creation to reduce manual intervention
- Improve on-call processes and incident management workflows
- Ensure high availability, scalability, and fault tolerance of systems
Required Skills
- 56 years of experience in Site Reliability Engineering / Production Support
- Strong hands-on experience with Grafana stack (Prometheus, Loki, Tempo)
- Solid understanding of monitoring, alerting, and observability principles
- Experience in incident management and handling P1/P2 incidents
- Knowledge of cloud platforms (AWS)
- Experience with Linux systems and troubleshooting
- Familiarity with Kubernetes / containerized environments
- Strong scripting skills (Python / Bash)
Services you might be interested in
We Search & Apply Jobs for You!
Our team scans through 1000s of opportunities and applies to roles best suited to your profile
Save 100+ hours and focus on what matters - cracking interviews and landing offers.
