RESPONSIBILITIES

Operate and optimize Kubernetes-based infrastructure using HELM/ kustomize for deployment and configuration management.
Build and maintain CI/CD pipelines for infrastructure and application deployments.
Manage and monitor cloud infrastructure on AWS (EKS, EC2, S3, IAM, VPC, etc.). and on premise infrastructure
Ensure observability through logging, monitoring, and alerting systems (e.g., Prometheus, Grafana, Cloudwatch, DataDog ).
Implement and enforce security best practices across infrastructure components.
Participate in on-call rotations, incident response, and root cause analysis.
Support scaling of systems to meet demand while maintaining reliability.
Collaborate with engineering and security teams on architecture and deployment strategies.
Ensure the implementation of security standards and compliance requirements across all operational aspects of the cloud platforms.

MUST HAVE SKILLS

3 - 6+ years of hands-on experience in SRE roles
2 - 4+ years of managing production Kubernetes environments
Currently operating production EKS clusters (hands-on, not observational)
Deep expertise in Kubernetes (EKS or self-managed) and Helm
Strong understanding of networking fundamentals: TCP/IP, DNS, VPNs, firewalls, load balancing
Practical experience with AWS services: EKS, EC2, IAM, S3, CloudWatch, VPC
Solid exposure to containerization (Docker) and CI/CD pipelines (e.g., Bitbucket Pipelines, GitHub Actions, ArgoCD, Flux CD)
Proven experience handling production systems, on-call rotations, and real-time incident response
Proficiency in at least one programming language (Python or Go preferred)
Clear understanding of the Software Development Life Cycle (SDLC)
Strong automation mindset with a bias toward eliminating manual toil
Ability to build and maintain Grafana dashboards using PromQL (or equivalent)
Strong grasp of SRE principles: SLIs, SLOs, error budgets, incident and post-incident management

NICE TO HAVE

QUALIFICATIONS/EXPERIENCE

WHAT DAY TO DAY LOOKS LIKE

Site Reliability Engineer