About the Job

The Site Reliability Engineering (SRE) team is responsible for ensuring the reliability, scalability, and performance of large-scale telecom and CPaaS platforms. This role combines software engineering and systems operations to build resilient, observable, and automated infrastructure that supports high-throughput messaging services. The team operates in a 24/7 environment and works closely with Engineering, CX and Products to maintain carrier-grade service reliability.

What youll be responsible for

Ensure high availability, performance, and reliability of CPaaS production systems speread across mutiple locations hosted over cloud and data centers
Own and improve SLIs, SLOs, and SLAs for messaging platforms and supporting services.
Monitor system health, latency, TPS, error rates, and delivery metrics using observability tools.
Participate in on-call rotations and handle production incidents with a focus on fast recovery and root cause analysis.
Deploy, configure, and optimize for high-throughput messaging (multiple channels)
Troubleshoot telecom-specific issues including DLR failures, encoding problems, TPS drops and routing issues.
Work directly with multiple teams for integrations, testing, and incident resolution.
Perform packet-level analysis using tcpdump and Wireshark to diagnose network and protocol-level issues.
Write and maintain shell scripts and automation to eliminate repetitive operational tasks and reduce human intervention.
Contribute to infrastructure automation using tools like Ansible and CI/CD pipelines where applicable.
Improve deployment, configuration, and rollback processes for messaging services.
Design and enhance monitoring, alerting, and dashboards using tools such as Datadog, Site24x7, ELK and Grafana.
Administer and troubleshoot Linux based servers in production environments.
Manage and optimize MySQL and MongoDB databases including performance tuning, backups, and recovery.
Works on API's and webhooks across the product & services. Its enhancements and troubleshooting.
Maintain web and application servers such as Apache, Nginx, and jboss (WildFly)
Support cloud-based and virtualized environments with exposure to auto-scaling and containerization concepts.
Collaborate with engineering teams on release planning, production deployments, and post-release validation.
Lead or contribute to incident response & RCA focusing on long-term reliability improvements.
Track issues, changes, and reliability work using Jira and related tools.

What youd have

B.Tech / B.E in Computer Science or related field with 23 years of experience in SRE, DevOps, telecom, or CPaaS operations.
Hands-on experience with SMS gateways and messaging workflows.
Solid understanding of Linux systems, networking fundamentals, and production troubleshooting.
Strong experience with MySQL & MongoDB administration, queries, and performance optimization.
Proficiency in shell scripting and a mindset toward automation and reliability engineering.
Hands-on experience with tcpdump, Wireshark, and protocol-level troubleshooting.
Experience with monitoring, logging, and alerting systems (Datadog, ELK, Grafana, Site24x7, etc.).
Familiarity with configuration management tools like Ansible and version control systems (Git).
Working knowledge of cloud platforms, virtualization, auto-scaling, and containerization.
Strong incident management, analytical thinking, and communication skills.
Certifications such as RHCE, AWS, or SRE-related credentials are a plus

Site Reliability Engineer

ValueFirst

Let experts apply while you prepare for interviews

Job Description

Services you might be interested in

We Search & Apply Jobs for You!