About the Role

We are looking for a passionate and detail-oriented Site Reliability Engineer (SRE) to join our engineering team. As an SRE, you will play a critical role in ensuring the reliability, scalability, and performance of our infrastructure and services. Youll work closely with development and QA teams to build, maintain, and scale production systems while implementing best practices for monitoring, automation, and incident management.

This position is ideal for engineers who thrive in complex distributed environments, are strong in Kubernetes, and enjoy improving system reliability through automation and modern tooling.

Key Responsibilities

Infrastructure Reliability & Performance
Maintain, monitor, and improve uptime and performance of production systems.
Design and implement scalable, reliable, and secure infrastructure on cloud platforms (AWS / GCP).
Kubernetes & Containerization
Deploy, manage, and optimize containerized workloads using Kubernetes and Helm.
Troubleshoot Kubernetes clusters, pods, and networking issues.
Manage CI/CD pipelines integrated with Kubernetes-based deployments.
Monitoring & Incident Response
Participate in on-call rotations for production support and incident response.
Conduct post-incident reviews and drive preventive improvements.
Security & Compliance
Implement and enforce security best practices in infrastructure and application deployments.
Manage access controls, secrets, and network policies in production environments.
Collaboration & Continuous Improvement
Work with development teams to design systems with reliability and scalability in mind.
Drive automation and self-healing capabilities for common operational tasks.
Contribute to SRE playbooks, runbooks, and documentation.

Required Skills & Qualifications

Education: Bachelors degree in Computer Science, Engineering, or related field (or equivalent experience).
Experience: 1-2 years of experience as an SRE / DevOps
Core Skills:
Strong experience with Kubernetes, Docker, and container orchestration.
Proficiency in Linux system administration and shell scripting.
Good knowledge of cloud platforms (AWS / GCP / Azure) and related services.
Basic understanding of networking concepts (DNS, Load Balancing, Firewalls, etc.).
Programming experience in Python, Go, or Bash for automation.

Good to Have

Experience in multi-cloud or hybrid cloud environments.
Certified Kubernetes Administrator (CKA)
Experience in cost optimization and capacity planning.
Understanding of SLOs, SLIs, and SLAs within an SRE framework.
Contribution to open-source projects or active participation in the SRE community.

Site Reliability Engineer

greytHR

Job Description

Services you might be interested in

We Search & Apply Jobs for You!