Senior Site Reliability Engineer
Arokee
5 - 10 years
Bengaluru
Posted: 15/04/2026
Job Description
We are seeking a highly experienced Site Reliability Engineer to join our enterprise infrastructure team. This role bridges software engineering and systems operations you will own availability, performance, and scalability of mission-critical platforms that serve millions of users globally.
JOB DETAILS
Job Title: Site Reliability Engineer (SRE)
Experience: 5+ Years (Enterprise)
Employment Type: Full-Time
Work Mode: Hybrid / On-site
KEY RESPONSIBILITIES
- Platform Reliability & Availability
- Define, monitor, and enforce SLOs/SLAs/SLIs across critical production services
- Lead incident response, root cause analysis (RCA), and blameless post-mortem processes
- Achieve and maintain 99.99%+ uptime targets for enterprise-grade systems
- Infrastructure Engineering
- Design, build, and maintain highly available, scalable cloud infrastructure (AWS / GCP / Azure)
- Manage Kubernetes clusters, containerized workloads, and service mesh (Istio/Linkerd)
- Implement and own infrastructure-as-code using Terraform, Pulumi, or Ansible
- Observability & Monitoring
- Build and maintain comprehensive observability stacks (Prometheus, Grafana, Datadog, ELK)
- Design proactive alerting strategies and dashboards for business-critical metrics
- Champion distributed tracing using Jaeger, Zipkin, or OpenTelemetry
- Automation & Toil Reduction
- Identify and eliminate operational toil through automation and tooling improvements
- Develop self-healing systems and automated remediation runbooks
- Contribute to internal developer platforms and CI/CD pipeline reliability
- Capacity Planning & Performance
- Perform load forecasting, capacity planning, and performance tuning at scale
- Conduct chaos engineering and game-day exercises to test system resilience
- Cross-functional Collaboration
- Partner with engineering, product, and security teams on production readiness reviews
- Define and enforce reliability standards in the SDLC and change management processes
REQUIRED QUALIFICATIONS
- 5+ years of hands-on SRE, DevOps, or production engineering in enterprise environments
- Deep expertise in Linux systems administration and networking fundamentals
- Strong programming skills in Python, Go, Bash, or Java for automation and tooling
- Experience managing large-scale Kubernetes deployments in production
- Proficiency with cloud platforms: AWS (preferred), GCP, or Azure at enterprise scale
- Expertise with IaC tools: Terraform and/or Ansible in multi-environment setups
- Proven track record of incident management and leading critical production incidents
- Strong understanding of SLOs, error budgets, and reliability engineering principles
PREFERRED QUALIFICATIONS
- Experience with service mesh technologies (Istio, Consul Connect)
- Background in financial services, healthcare, or other regulated enterprise industries
- Familiarity with FinOps and cloud cost optimization strategies
- Certifications: CKA/CKAD, AWS Solutions Architect, Google Cloud Professional
- Experience with GitOps workflows (ArgoCD, Flux)
- Published runbooks, SRE playbooks, or internal engineering blog contributions
TECHNICAL SKILLS MATRIX
Cloud & Infra: AWS / GCP / Azure, Terraform (Pulumi, CDK, FinOps)
Containers: Kubernetes, Docker, Helm (Service Mesh, OPA)
Observability: Prometheus, Grafana, ELK (Datadog, New Relic, OTEL)
Languages :Python, Bash, Go (Java, Rust)
CI/CD : Jenkins, GitHub Actions (ArgoCD, Flux, Spinnaker)
Databases: PostgreSQL, Redis, MySQL (Cassandra, MongoDB)
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
