Login Sign Up
🔔 FCM Loaded

Senior Site Reliability Engineer

Arokee

5 - 10 years

Bengaluru

Posted: 15/04/2026

Getting a referral is 5x more effective than applying directly

Job Description

We are seeking a highly experienced Site Reliability Engineer to join our enterprise infrastructure team. This role bridges software engineering and systems operations you will own availability, performance, and scalability of mission-critical platforms that serve millions of users globally.


JOB DETAILS

Job Title: Site Reliability Engineer (SRE)

Experience: 5+ Years (Enterprise)

Employment Type: Full-Time

Work Mode: Hybrid / On-site


KEY RESPONSIBILITIES

  • Platform Reliability & Availability
  • Define, monitor, and enforce SLOs/SLAs/SLIs across critical production services
  • Lead incident response, root cause analysis (RCA), and blameless post-mortem processes
  • Achieve and maintain 99.99%+ uptime targets for enterprise-grade systems
  • Infrastructure Engineering
  • Design, build, and maintain highly available, scalable cloud infrastructure (AWS / GCP / Azure)
  • Manage Kubernetes clusters, containerized workloads, and service mesh (Istio/Linkerd)
  • Implement and own infrastructure-as-code using Terraform, Pulumi, or Ansible
  • Observability & Monitoring
  • Build and maintain comprehensive observability stacks (Prometheus, Grafana, Datadog, ELK)
  • Design proactive alerting strategies and dashboards for business-critical metrics
  • Champion distributed tracing using Jaeger, Zipkin, or OpenTelemetry
  • Automation & Toil Reduction
  • Identify and eliminate operational toil through automation and tooling improvements
  • Develop self-healing systems and automated remediation runbooks
  • Contribute to internal developer platforms and CI/CD pipeline reliability
  • Capacity Planning & Performance
  • Perform load forecasting, capacity planning, and performance tuning at scale
  • Conduct chaos engineering and game-day exercises to test system resilience
  • Cross-functional Collaboration
  • Partner with engineering, product, and security teams on production readiness reviews
  • Define and enforce reliability standards in the SDLC and change management processes


REQUIRED QUALIFICATIONS

  • 5+ years of hands-on SRE, DevOps, or production engineering in enterprise environments
  • Deep expertise in Linux systems administration and networking fundamentals
  • Strong programming skills in Python, Go, Bash, or Java for automation and tooling
  • Experience managing large-scale Kubernetes deployments in production
  • Proficiency with cloud platforms: AWS (preferred), GCP, or Azure at enterprise scale
  • Expertise with IaC tools: Terraform and/or Ansible in multi-environment setups
  • Proven track record of incident management and leading critical production incidents
  • Strong understanding of SLOs, error budgets, and reliability engineering principles


PREFERRED QUALIFICATIONS

  • Experience with service mesh technologies (Istio, Consul Connect)
  • Background in financial services, healthcare, or other regulated enterprise industries
  • Familiarity with FinOps and cloud cost optimization strategies
  • Certifications: CKA/CKAD, AWS Solutions Architect, Google Cloud Professional
  • Experience with GitOps workflows (ArgoCD, Flux)
  • Published runbooks, SRE playbooks, or internal engineering blog contributions


TECHNICAL SKILLS MATRIX

Cloud & Infra: AWS / GCP / Azure, Terraform (Pulumi, CDK, FinOps)

Containers: Kubernetes, Docker, Helm (Service Mesh, OPA)

Observability: Prometheus, Grafana, ELK (Datadog, New Relic, OTEL)

Languages :Python, Bash, Go (Java, Rust)

CI/CD : Jenkins, GitHub Actions (ArgoCD, Flux, Spinnaker)

Databases: PostgreSQL, Redis, MySQL (Cassandra, MongoDB)


Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.