Site Reliability Engineer
Tricog Health
2 - 5 years
Bengaluru
Posted: 05/02/2026
Getting a referral is 5x more effective than applying directly
Job Description
RESPONSIBILITIES
- Operate and optimize Kubernetes-based infrastructure using HELM/ kustomize for deployment and configuration management.
- Build and maintain CI/CD pipelines for infrastructure and application deployments.
- Manage and monitor cloud infrastructure on AWS (EKS, EC2, S3, IAM, VPC, etc.). and on premise infrastructure
- Ensure observability through logging, monitoring, and alerting systems (e.g., Prometheus, Grafana, Cloudwatch, DataDog ).
- Implement and enforce security best practices across infrastructure components.
- Participate in on-call rotations, incident response, and root cause analysis.
- Support scaling of systems to meet demand while maintaining reliability.
- Collaborate with engineering and security teams on architecture and deployment strategies.
- Ensure the implementation of security standards and compliance requirements across all operational aspects of the cloud platforms.
MUST HAVE SKILLS
- 3 - 6+ years of hands-on experience in SRE roles
- 2 - 4+ years of managing production Kubernetes environments
- Currently operating production EKS clusters (hands-on, not observational)
- Deep expertise in Kubernetes (EKS or self-managed) and Helm
- Strong understanding of networking fundamentals: TCP/IP, DNS, VPNs, firewalls, load balancing
- Practical experience with AWS services: EKS, EC2, IAM, S3, CloudWatch, VPC
- Solid exposure to containerization (Docker) and CI/CD pipelines (e.g., Bitbucket Pipelines, GitHub Actions, ArgoCD, Flux CD)
- Proven experience handling production systems, on-call rotations, and real-time incident response
- Proficiency in at least one programming language (Python or Go preferred)
- Clear understanding of the Software Development Life Cycle (SDLC)
- Strong automation mindset with a bias toward eliminating manual toil
- Ability to build and maintain Grafana dashboards using PromQL (or equivalent)
- Strong grasp of SRE principles: SLIs, SLOs, error budgets, incident and post-incident management
NICE TO HAVE
- Experience in regulated industries (healthcare,fintech).
- Experience with incident management and disaster recovery.
QUALIFICATIONS/EXPERIENCE
- Minimum of 3 years with 2+ years of SRE experience.
- BTech/BE/BS or MTech/MCA/ME/MS
- 2+ years of work experience with Amazon Web Services (AWS)
- 2+ years of work experience with Kubernetes
- 2+ years of work experience with Site Reliability Engineering
- Working in a hybrid setting
WHAT DAY TO DAY LOOKS LIKE
- Monitoring Service-Level Indicators (SLIs)
- Setting Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs)
- Responding to Incidents
- Writing Postmortems
- Automating System Tasks
- Cross-Department Collaboration
- Building Software for DevOps, SRE, and Support Teams
- Fixing Support Escalation Issues
- Optimizing On-Call Rotations and Processes
- Documenting "Tribal" Knowledge
- Conducting Post-Incident Reviews
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
