Site Reliability Engineer
TecQubes Technologies
2 - 5 years
Bengaluru
Posted: 12/02/2026
Job Description
Company Description
TecQubes Technologies is a global company dedicated to streamlining business operations and delivering fast, output-driven results. With a dynamic research team, we offer advanced solutions designed to move businesses closer to solid success and a bright future. Our sophisticated technologies bring innovative technology products and services to our clients. Leveraging strong research and market insights, we consistently deliver the best results.
Role Description
We are seeking a highly experienced Site Reliability Engineer (SRE)with 10+ years of experiencein designing, implementing, and maintaining highly available, scalable, and resilient systems. The ideal candidate will have deep expertise in AWS, Kubernetes, Elasticsearch, Grafana, and modern SRE practices, with a strong focus on automation, observability, and operational excellence.
Qualifications
- 10+ yearsof experience in Site Reliability Engineering, DevOps, or Platform Engineering.
- Strong hands-on experience with AWS services(EC2, EKS, S3, RDS, IAM, VPC, CloudWatch, Auto Scaling).
- Advanced expertise in Kubernetes(EKS preferred), Helm, and container orchestration.
- Deep knowledge of Elasticsearch(cluster management, indexing, search optimization, performance tuning).
- Strong experience with Grafanaand observability stacks (Prometheus, Loki, ELK).
- Proficiency in Linux system administrationand networking fundamentals.
- Experience with Infrastructure as Codetools (Terraform, CloudFormation).
- Strong scripting skills in Python, Bash, or Go.
Key Responsibilities
- Design, build, and operate highly reliable, scalable, and fault-tolerant systemsin AWS cloud environments.
- Implement and manage Kubernetes (EKS)clusters, including deployment strategies, scaling, upgrades, and security hardening.
- Own and improve SLIs, SLOs, and SLAs, driving reliability through data-driven decisions.
- Architect and maintain observability platformsusing Grafana, Prometheus, and Elasticsearch.
- Manage and optimize Elasticsearch clusters, including indexing strategies, performance tuning, scaling, and backup/restore.
- Develop and maintain monitoring, alerting, and logging solutionsto ensure proactive incident detection and response.
- Lead incident management, root cause analysis (RCA), postmortems, and continuous improvement initiatives.
- Automate infrastructure and operations using Infrastructure as Code (IaC)and scripting.
- Collaborate with development teams to improve system reliability, deployment pipelines, and release processes.
- Implement CI/CD best practicesand reduce deployment risk through canary, blue-green, and rolling deployments.
- Ensure security, compliance, and cost optimization across cloud infrastructure.
- Mentor junior SREs and drive adoption of SRE best practices across teams.
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
