🔔 FCM Loaded

Site Reliability Engineer

Tricog Health

2 - 5 years

Bengaluru

Posted: 05/02/2026

Getting a referral is 5x more effective than applying directly

Job Description

RESPONSIBILITIES

  • Operate and optimize Kubernetes-based infrastructure using HELM/ kustomize for deployment and configuration management.
  • Build and maintain CI/CD pipelines for infrastructure and application deployments.
  • Manage and monitor cloud infrastructure on AWS (EKS, EC2, S3, IAM, VPC, etc.). and on premise infrastructure
  • Ensure observability through logging, monitoring, and alerting systems (e.g., Prometheus, Grafana, Cloudwatch, DataDog ).
  • Implement and enforce security best practices across infrastructure components.
  • Participate in on-call rotations, incident response, and root cause analysis.
  • Support scaling of systems to meet demand while maintaining reliability.
  • Collaborate with engineering and security teams on architecture and deployment strategies.
  • Ensure the implementation of security standards and compliance requirements across all operational aspects of the cloud platforms.


MUST HAVE SKILLS

  • 3 - 6+ years of hands-on experience in SRE roles
  • 2 - 4+ years of managing production Kubernetes environments
  • Currently operating production EKS clusters (hands-on, not observational)
  • Deep expertise in Kubernetes (EKS or self-managed) and Helm
  • Strong understanding of networking fundamentals: TCP/IP, DNS, VPNs, firewalls, load balancing
  • Practical experience with AWS services: EKS, EC2, IAM, S3, CloudWatch, VPC
  • Solid exposure to containerization (Docker) and CI/CD pipelines (e.g., Bitbucket Pipelines, GitHub Actions, ArgoCD, Flux CD)
  • Proven experience handling production systems, on-call rotations, and real-time incident response
  • Proficiency in at least one programming language (Python or Go preferred)
  • Clear understanding of the Software Development Life Cycle (SDLC)
  • Strong automation mindset with a bias toward eliminating manual toil
  • Ability to build and maintain Grafana dashboards using PromQL (or equivalent)
  • Strong grasp of SRE principles: SLIs, SLOs, error budgets, incident and post-incident management


NICE TO HAVE

  • Experience in regulated industries (healthcare,fintech).
  • Experience with incident management and disaster recovery.


QUALIFICATIONS/EXPERIENCE

  • Minimum of 3 years with 2+ years of SRE experience.
  • BTech/BE/BS or MTech/MCA/ME/MS
  • 2+ years of work experience with Amazon Web Services (AWS)
  • 2+ years of work experience with Kubernetes
  • 2+ years of work experience with Site Reliability Engineering
  • Working in a hybrid setting


WHAT DAY TO DAY LOOKS LIKE

  • Monitoring Service-Level Indicators (SLIs)
  • Setting Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs)
  • Responding to Incidents
  • Writing Postmortems
  • Automating System Tasks
  • Cross-Department Collaboration
  • Building Software for DevOps, SRE, and Support Teams
  • Fixing Support Escalation Issues
  • Optimizing On-Call Rotations and Processes
  • Documenting "Tribal" Knowledge
  • Conducting Post-Incident Reviews

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.