We are seeking a Senior Observability / Monitoring Engineer to drive end-to-end observability and monitoring for enterprise platforms. This role will focus on enabling proactive issue detection, faster incident resolution, and improved system reliability through effective use of observability tools and practices.

The ideal candidate will bring strong experience in logs, metrics, traces, alerting strategies, and monitoring tools, along with hands-on exposure to production environments and SRE practices.

Key Responsibilities

Observability Engineering

Design and implement end-to-end observability solutions across applications and infrastructure

Establish unified visibility across logs, metrics, and distributed tracing

Define and standardize monitoring frameworks, dashboards, and alerting strategies

Enable proactive detection of issues through intelligent alerting and anomaly detection

Monitoring & Tooling

Implement and manage tools such as Splunk, Datadog, Prometheus, Grafana, New Relic, or similar

Build actionable dashboards for SRE, operations, and business stakeholders

Optimize alert configurations to reduce noise and improve signal quality

Continuously enhance monitoring coverage across systems and services

Incident Support & Reliability

Support late night / US overlap shift for production monitoring and incident response

Analyze logs, metrics, and traces to support incident triage and root cause analysis (RCA)

Collaborate with SRE and engineering teams to improve system reliability and performance

Participate in post-incident reviews and continuous improvement initiatives

Automation & Integration

Automate monitoring setup and configuration using Infrastructure as Code (IaC)

Integrate observability tools with CI/CD pipelines and DevOps workflows

Develop scripts/tools to improve data collection, alerting, and reporting

Platform & Integration Support

Monitor enterprise applications, APIs, and integration layers (e.g., middleware, cloud services)

Ensure end-to-end visibility across distributed systems and microservices architectures

Work closely with platform teams (cloud, Salesforce, etc.) to enhance observability

Governance & Compliance

Ensure monitoring practices align with security and compliance requirements (e.g., SOX)

Maintain runbooks, documentation, and monitoring standards

Support audit and governance requirements as needed

Required Skills & Qualifications

Technical Skills

Strong experience in observability, monitoring, or SRE roles

Hands-on experience with tools like Splunk, Datadog, Prometheus, Grafana, New Relic

Strong understanding of logs, metrics, traces, and distributed systems

Experience with APM tools and performance monitoring

Scripting skills (Python, Bash, PowerShell, or similar)

Familiarity with CI/CD tools (Jenkins, GitHub Actions, Azure DevOps)

Knowledge of Infrastructure as Code (Terraform or similar)

Operational Excellence

Experience supporting production environments in 24x7 models

Strong incident management and RCA capabilities

Ability to analyze performance issues and recommend improvements

Soft Skills

Ability to work effectively in a late night / US overlap shift

Strong communication and collaboration skills

Proactive mindset with a focus on continuous improvement

Senior Site Reliability Engineer

Brillio

Job Description

Services you might be interested in

We Search & Apply Jobs for You!