We are seeking a highly experienced Site Reliability Engineer to join our enterprise infrastructure team. This role bridges software engineering and systems operations you will own availability, performance, and scalability of mission-critical platforms that serve millions of users globally.

JOB DETAILS

Job Title: Site Reliability Engineer (SRE)

Experience: 5+ Years (Enterprise)

Employment Type: Full-Time

Work Mode: Hybrid / On-site

KEY RESPONSIBILITIES

Platform Reliability & Availability
Define, monitor, and enforce SLOs/SLAs/SLIs across critical production services
Lead incident response, root cause analysis (RCA), and blameless post-mortem processes
Achieve and maintain 99.99%+ uptime targets for enterprise-grade systems
Infrastructure Engineering
Design, build, and maintain highly available, scalable cloud infrastructure (AWS / GCP / Azure)
Manage Kubernetes clusters, containerized workloads, and service mesh (Istio/Linkerd)
Implement and own infrastructure-as-code using Terraform, Pulumi, or Ansible
Observability & Monitoring
Build and maintain comprehensive observability stacks (Prometheus, Grafana, Datadog, ELK)
Design proactive alerting strategies and dashboards for business-critical metrics
Champion distributed tracing using Jaeger, Zipkin, or OpenTelemetry
Automation & Toil Reduction
Identify and eliminate operational toil through automation and tooling improvements
Develop self-healing systems and automated remediation runbooks
Contribute to internal developer platforms and CI/CD pipeline reliability
Capacity Planning & Performance
Perform load forecasting, capacity planning, and performance tuning at scale
Conduct chaos engineering and game-day exercises to test system resilience
Cross-functional Collaboration
Partner with engineering, product, and security teams on production readiness reviews
Define and enforce reliability standards in the SDLC and change management processes

REQUIRED QUALIFICATIONS

5+ years of hands-on SRE, DevOps, or production engineering in enterprise environments
Deep expertise in Linux systems administration and networking fundamentals
Strong programming skills in Python, Go, Bash, or Java for automation and tooling
Experience managing large-scale Kubernetes deployments in production
Proficiency with cloud platforms: AWS (preferred), GCP, or Azure at enterprise scale
Expertise with IaC tools: Terraform and/or Ansible in multi-environment setups
Proven track record of incident management and leading critical production incidents
Strong understanding of SLOs, error budgets, and reliability engineering principles

PREFERRED QUALIFICATIONS

Experience with service mesh technologies (Istio, Consul Connect)
Background in financial services, healthcare, or other regulated enterprise industries
Familiarity with FinOps and cloud cost optimization strategies
Certifications: CKA/CKAD, AWS Solutions Architect, Google Cloud Professional
Experience with GitOps workflows (ArgoCD, Flux)
Published runbooks, SRE playbooks, or internal engineering blog contributions

TECHNICAL SKILLS MATRIX

Cloud & Infra: AWS / GCP / Azure, Terraform (Pulumi, CDK, FinOps)

Containers: Kubernetes, Docker, Helm (Service Mesh, OPA)

Observability: Prometheus, Grafana, ELK (Datadog, New Relic, OTEL)

Languages :Python, Bash, Go (Java, Rust)

CI/CD : Jenkins, GitHub Actions (ArgoCD, Flux, Spinnaker)

Databases: PostgreSQL, Redis, MySQL (Cassandra, MongoDB)

Senior Site Reliability Engineer

Arokee

Job Description

Services you might be interested in

We Search & Apply Jobs for You!