Senior Site Reliability Engineer (GCP | Terraform | Ansible | SRE | On-Call)

We are looking for a high-impact Site Reliability Engineer (SRE) who will play a key role in ensuring the reliability, availability, and scalability of our production systems on Google Cloud Platform (GCP) .

If you thrive in fast-paced environments, excel in incident management, and love building automated, scalable infrastructurethis role is for you.

Responsibilities

Production Reliability & On-Call Excellence

Act as a primary responder in a 247 rotational on-call schedule .
Rapidly identify, mitigate, and resolve high-severity production incidents impacting GCP services.
Conduct detailed Root Cause Analysis (RCA) and implement long-term corrective actions.

Infrastructure-as-Code (IaC)

Design, build, and maintain large-scale, multi-environment infrastructure using Terraform .
Develop reusable modules, follow best practices, and maintain version-controlled infrastructure deployments.

Configuration Management

Build and optimize Ansible playbooks and roles for configuration consistency, patching, and environment provisioning.

Automation & Tooling

Develop automation using Python, Go, or Bash to eliminate operational toil and accelerate engineering productivity.
Drive automation-first culture across the SRE team.

Monitoring, Observability & Tooling

Enhance monitoring, logging, and alerting using tools like Prometheus, Grafana, Stackdriver , or similar.
Improve observability for proactive detection of service health degradation.

Containers & Orchestration

Manage and troubleshoot Kubernetes (GKE) clusters for deployment, scaling, and reliability of containerized applications.

SRE Best Practices

Define and measure SLIs/SLOs , engineer reliability, and reduce toil through automation.
Collaborate closely with DevOps, Cloud, and Engineering teams for continuous improvement.

Requirements

Must Have

3+ years of hands-on experience on GCP , including GKE, GCE, VPC networking, IAM, load balancers, security, and networking fundamentals.
Advanced expertise in Terraform for production-grade infrastructure deployments.
Strong Ansible experience for configuration management.
Proven experience in on-call rotations , incident response, and handling critical production issues.
Proficiency in Python, Go, or Bash for automation.
Strong understanding of SRE principles : SLIs/SLOs, error budgets, incident management, RCA.
Experience with Kubernetes , containerization, and troubleshooting distributed systems.

Nice to Have

Exposure to service mesh (Istio/Linkerd).
Experience with CI/CD pipelines (Jenkins, GitLab CI, Cloud Build).
Networking and security certifications (GCP Associate Cloud Engineer / Professional Cloud DevOps Engineer).

What We Offer

Opportunity to work on high-scale, mission-critical systems .
A culture of ownership, innovation, and automation.
Competitive compensation + on-call benefits.
Growth opportunities in SRE, Cloud, and Platform Engineering tracks.

How to Apply

Share your updated resume at:

Site Reliability Engineer

Enterprise Minds, Inc

Let experts apply while you prepare for interviews

Job Description

Services you might be interested in

We Search & Apply Jobs for You!