Role Summary

We are looking for a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our cloud-native infrastructure . The ideal candidate will bring strong hands-on experience in AWS, Kubernetes, Docker, CI/CD pipelines, monitoring, and automation using Python , and will work closely with development and operations teams to build resilient, highly available systems.

Key Responsibilities

Design, deploy, and maintain highly available and scalable systems on AWS
Manage and operate containerized applications using Docker and Kubernetes (EKS)
Build, maintain, and optimize CI/CD pipelines using Jenkins
Automate operational workflows and routine tasks using Python scripting
Implement and manage monitoring, alerting, and observability using Grafana and Prometheus
Ensure system reliability, performance, uptime, and scalability
Participate in incident response , root cause analysis (RCA), and post-incident reviews
Implement Infrastructure as Code (IaC) and automation best practices
Collaborate with development teams to improve system architecture and deployment strategies
Enforce security, compliance, and operational best practices in cloud environments
Continuously improve system efficiency through automation, tooling, and process optimization

Required Skills & Qualifications

Strong hands-on experience with AWS services (EC2, S3, IAM, VPC, RDS, EKS, etc.)
Solid experience with Kubernetes (EKS) and Docker
Proficiency in Python scripting for automation and monitoring
Experience designing and managing CI/CD pipelines using Jenkins
Strong understanding of DevOps principles and CI/CD best practices
Hands-on experience with Grafana and Prometheus for monitoring and alerting
Strong knowledge of Linux systems and networking fundamentals
Experience with Git or other version control systems
Understanding of microservices architecture

Good to Have

Experience with Terraform or CloudFormation
Knowledge of Helm, ArgoCD, or similar deployment tools
Familiarity with log management tools (ELK / EFK stack)
Understanding of SRE practices such as SLIs, SLOs, SLAs, and error budgets
AWS and/or Kubernetes certifications (CKA / CKAD)

Site Reliability Engineer

TRDFIN Support Services Pvt Ltd

Let experts apply while you prepare for interviews

Job Description

Services you might be interested in

We Search & Apply Jobs for You!