Senior Site Reliability Engineer
Brillio
5 - 10 years
Bengaluru
Posted: 16/05/2026
Job Description
We are seeking a Senior Observability / Monitoring Engineer to drive end-to-end observability and monitoring for enterprise platforms. This role will focus on enabling proactive issue detection, faster incident resolution, and improved system reliability through effective use of observability tools and practices.
The ideal candidate will bring strong experience in logs, metrics, traces, alerting strategies, and monitoring tools, along with hands-on exposure to production environments and SRE practices.
Key Responsibilities
Observability Engineering
Design and implement end-to-end observability solutions across applications and infrastructure
Establish unified visibility across logs, metrics, and distributed tracing
Define and standardize monitoring frameworks, dashboards, and alerting strategies
Enable proactive detection of issues through intelligent alerting and anomaly detection
Monitoring & Tooling
Implement and manage tools such as Splunk, Datadog, Prometheus, Grafana, New Relic, or similar
Build actionable dashboards for SRE, operations, and business stakeholders
Optimize alert configurations to reduce noise and improve signal quality
Continuously enhance monitoring coverage across systems and services
Incident Support & Reliability
Support late night / US overlap shift for production monitoring and incident response
Analyze logs, metrics, and traces to support incident triage and root cause analysis (RCA)
Collaborate with SRE and engineering teams to improve system reliability and performance
Participate in post-incident reviews and continuous improvement initiatives
Automation & Integration
Automate monitoring setup and configuration using Infrastructure as Code (IaC)
Integrate observability tools with CI/CD pipelines and DevOps workflows
Develop scripts/tools to improve data collection, alerting, and reporting
Platform & Integration Support
Monitor enterprise applications, APIs, and integration layers (e.g., middleware, cloud services)
Ensure end-to-end visibility across distributed systems and microservices architectures
Work closely with platform teams (cloud, Salesforce, etc.) to enhance observability
Governance & Compliance
Ensure monitoring practices align with security and compliance requirements (e.g., SOX)
Maintain runbooks, documentation, and monitoring standards
Support audit and governance requirements as needed
Required Skills & Qualifications
Technical Skills
Strong experience in observability, monitoring, or SRE roles
Hands-on experience with tools like Splunk, Datadog, Prometheus, Grafana, New Relic
Strong understanding of logs, metrics, traces, and distributed systems
Experience with APM tools and performance monitoring
Scripting skills (Python, Bash, PowerShell, or similar)
Familiarity with CI/CD tools (Jenkins, GitHub Actions, Azure DevOps)
Knowledge of Infrastructure as Code (Terraform or similar)
Operational Excellence
Experience supporting production environments in 24x7 models
Strong incident management and RCA capabilities
Ability to analyze performance issues and recommend improvements
Soft Skills
Ability to work effectively in a late night / US overlap shift
Strong communication and collaboration skills
Proactive mindset with a focus on continuous improvement
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
