🔔 FCM Loaded

Reliability Engineer (Observability & SRE)

Provate, Inc.

2 - 5 years

Hyderabad

Posted: 09/01/2026

Getting a referral is 5x more effective than applying directly

Job Description

Company Description

Provate is a global IT servicesand engineering company deliveringcloud, platform engineering, observability, and reliability solutionsfor enterprise and high-growth digital platforms.We help enterprises and high-scale digital platforms buildresilient, scalable, and high-performance systemsby combining deep technicalexpertisewith strong operational practices.


Role Description

As aReliability Engineer at Provate, you will be part of a Site Reliability & Observability team responsible for ensuring thereliability, performance, and scalabilityof a global, high-traffic digital platformoperatingata significantscale. This role is part of along-term client engagementsupporting business-critical services that processlarge volumes of real-time data, requiring strong observability and reliability foundations.

In this hands-on role, you will helpdesign, build, and scale observability systemsfor metrics, logs, and distributed tracing, enabling engineering teams to gain deep visibility into system behavior, troubleshoot issues quickly, and continuously improve service reliability. You will collaborate closely with application, platform, and operations teams to ensureactionable alerting, avoid alert fatigue, andoptimizesystems forperformance, cost, and uptime.

You will also play a key role in shaping how systems areobserved, measured, and improved, driving the adoption ofSRE best practicesand fostering a culture of ownership, accountability, and operational excellence. A core focus of this role is partnering with application teams to engineer improvements that deliverhigh availability and consistent performancefor critical services.


Key Responsibilities

  • Build and scale observability systems: Design andmaintaininfrastructure for collecting, aggregating, and analyzing telemetry data (metrics, logs, and traces).
  • Enable actionable insights: Develop dashboards, alerts, and visualizations that turn raw data into clear, meaningful information for engineers, SREs, and businessstakeholders.
  • Collaborate across teams: Partner with engineering, operations, and SRE teams to defineSLIs/SLOs and improve visibility into system performance and health.
  • Drive best practices: Advocate for and support consistent instrumentation, effective alerting, and strong observability practices across engineering teams.
  • Optimizesystems and tools: Continuously assess performance, usage, and cost of observability tools,identifyingopportunities for improvement and efficiency.
  • Automate: Engineer capabilities that will drive the adoption of SRE principles and best practices into what is deployed within theenvironment.
  • Improve: In collaboration with engineeringteamsdevelop plans to improve the reliability of applications and infrastructure andassistthese teams with the engineering of these improvements.
  • Support incident response: Participatein and help improve the incident response process, reducingMTTRand contributing to post-incident reviews and root cause analysis.


WhatWereLooking For

Technical Skills

  • 3+ years of experience inObservability, SRE, DevOps, or Platform Engineering roles.
  • Programming experience inGo, Python, Java, or Node.js , with the ability to build tools and advise on application-level instrumentation improvements.
  • Strong hands-on experience with observability tools, including:
  • LGTM stack (Loki, Grafana, Tempo, Mimir)
  • Datadog
  • AWS CloudWatch
  • Prometheus
  • PagerDuty
  • ClickStack
  • VictoriaMetrics
  • Groundcover
  • Libre
  • Zabbix
  • Cloud experience with AWS and services like EC2, EKS, ECS, VPC networking
  • Containers & orchestration : Familiarity with Docker and Kubernetes.
  • Infrastructure as Code &automation: Experience with tools like Terraform, Ansible, Chef, or SCCM to manage observability infrastructure at scale.
  • Linux systems knowledge: Strong understanding of Linux, shell scripting, and the storage/networking stack.
  • Tracing: Deep understanding of tracing technology andOpenTelemetry
  • SRE Practices : SLIs, SLOs, Error Budgets, and Failure Domains

Soft Skills

  • Strong analytical skills for interpreting telemetry data andidentifyingtrends or anomalies.
  • Clear and effective written and verbal communication skills.
  • Ability to influence engineering teams on monitoring standards and reliability improvements.


Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.