5+ years in observability, monitoring, or reliability engineering roles.
Hands-on experience with common observability tools such as Prometheus, Grafana, Splunk, Coralogix, and external monitoring tools (e.g., Catchpoint, ThousandEyes).
Strong scripting skills in Python, plus Bash or PowerShell for automation.
Experience with Terraform and Ansible for infrastructure automation.
Solid understanding of SLIs, SLOs, error budgets, and reliability engineering principles.
Familiarity with Linux environments and distributed systems.
Design and implement a Universal Dashboard in Grafana for leadership and engineering visibility.
Ensure a consistent look and feel across all observability views.
Define and implement SLIs, SLOs, and error budgets for critical services.
Establish alerting thresholds and escalation workflows aligned with reliability goals.
Integrate anomaly detection and AI-assisted insights into the observability platform.
Contribute to self-healing workflows and automated remediation strategies.
Partner with engineering teams to instrument services with metrics, logs, and traces.
Provide documentation and best practices for observability adoption across teams.

Site Reliability Engineer