Role Overview

We are looking for an experienced Senior Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our production systems. The ideal candidate will have strong troubleshooting skills, hands-on experience with messaging queues , in-memory queues , Kubernetes , and deployment automation , along with expertise in Infrastructure as Code and microservices architecture .

Key Responsibilities

Application Troubleshooting: Diagnose and resolve complex application issues in production environments.
Queue Management: Work with messaging queues (Kafka, RabbitMQ) and in-memory queues (Redis) to maintain system performance.
Deployment & Automation: Manage deployments using CI/CD pipelines and automation tools.
Kubernetes Administration: Maintain and optimize Kubernetes clusters for high availability and scalability.
Production Support: Provide support for critical production systems, ensuring uptime and reliability.
Monitoring & Alerting: Implement and maintain monitoring solutions (Prometheus, Grafana, ELK stack).
Incident Management: Lead root cause analysis and post-mortem reviews for production incidents.

Must-Have Skills

Strong experience in troubleshooting application issues in distributed systems.
Hands-on experience with messaging queues (Kafka, RabbitMQ) and in-memory queues (Redis).
Proficiency in Kubernetes and container orchestration.
Experience with CI/CD pipelines and deployment automation.
Solid understanding of Linux systems , networking, and cloud platforms (AWS, Azure, or GCP).
Infrastructure as Code experience (Terraform, Ansible).
Knowledge of microservices architecture .
Strong scripting and automation skills (Python, Bash, or similar).
Database expertise: Working experience with MySQL /Oracle /MongoDB .

Nice-to-Have

Experience with WhatsApp Business Messaging APIs and related integration skills.
Experience with security best practices in production environments.
Familiarity with observability tools and performance tuning.

Key Performance Indicators (KPIs)

System Uptime: Maintain production uptime of 99.9% or higher .
Incident Response Time: Respond to critical incidents within 15 minutes and resolve within SLA.
Deployment Success Rate: Achieve 98%+ successful deployments .
Mean Time to Recovery (MTTR): Reduce MTTR for production issues to under 60 minutes .
Automation Coverage: Automate 80%+ of repetitive operational tasks .
Monitoring & Alerting: Ensure 100% coverage of critical services with proactive alerting.
Infrastructure as Code Adoption: Maintain 100% IaC compliance for infrastructure changes.

Why join us?

Impactful Work : Solve meaningful real-life business problems by building cutting-edge products.
Tremendous Growth Opportunities: Work in a fast-growing CPaaS and product-driven culture with scope for continuous professional development.
Innovative Environment: Be part of a world-class team that loves solving tough problems and values innovation.

Tanla is an equal opportunity employer. We champion diversity and are committed to creating an inclusive environment for all employees.

Site Reliability Engineer (CPaaS)

Karix

Let experts apply while you prepare for interviews

Job Description

Services you might be interested in

We Search & Apply Jobs for You!