Site Reliability Engineer
ValueFirst
3 - 5 years
Gurugram
Posted: 17/02/2026
Getting a referral is 5x more effective than applying directly
Job Description
About the Job
The Site Reliability Engineering (SRE) team is responsible for ensuring the reliability, scalability, and performance of large-scale telecom and CPaaS platforms. This role combines software engineering and systems operations to build resilient, observable, and automated infrastructure that supports high-throughput messaging services. The team operates in a 24/7 environment and works closely with Engineering, CX and Products to maintain carrier-grade service reliability.
What youll be responsible for
- Ensure high availability, performance, and reliability of CPaaS production systems speread across mutiple locations hosted over cloud and data centers
- Own and improve SLIs, SLOs, and SLAs for messaging platforms and supporting services.
- Monitor system health, latency, TPS, error rates, and delivery metrics using observability tools.
- Participate in on-call rotations and handle production incidents with a focus on fast recovery and root cause analysis.
- Deploy, configure, and optimize for high-throughput messaging (multiple channels)
- Troubleshoot telecom-specific issues including DLR failures, encoding problems, TPS drops and routing issues.
- Work directly with multiple teams for integrations, testing, and incident resolution.
- Perform packet-level analysis using tcpdump and Wireshark to diagnose network and protocol-level issues.
- Write and maintain shell scripts and automation to eliminate repetitive operational tasks and reduce human intervention.
- Contribute to infrastructure automation using tools like Ansible and CI/CD pipelines where applicable.
- Improve deployment, configuration, and rollback processes for messaging services.
- Design and enhance monitoring, alerting, and dashboards using tools such as Datadog, Site24x7, ELK and Grafana.
- Administer and troubleshoot Linux based servers in production environments.
- Manage and optimize MySQL and MongoDB databases including performance tuning, backups, and recovery.
- Works on API's and webhooks across the product & services. Its enhancements and troubleshooting.
- Maintain web and application servers such as Apache, Nginx, and jboss (WildFly)
- Support cloud-based and virtualized environments with exposure to auto-scaling and containerization concepts.
- Collaborate with engineering teams on release planning, production deployments, and post-release validation.
- Lead or contribute to incident response & RCA focusing on long-term reliability improvements.
- Track issues, changes, and reliability work using Jira and related tools.
What youd have
- B.Tech / B.E in Computer Science or related field with 23 years of experience in SRE, DevOps, telecom, or CPaaS operations.
- Hands-on experience with SMS gateways and messaging workflows.
- Solid understanding of Linux systems, networking fundamentals, and production troubleshooting.
- Strong experience with MySQL & MongoDB administration, queries, and performance optimization.
- Proficiency in shell scripting and a mindset toward automation and reliability engineering.
- Hands-on experience with tcpdump, Wireshark, and protocol-level troubleshooting.
- Experience with monitoring, logging, and alerting systems (Datadog, ELK, Grafana, Site24x7, etc.).
- Familiarity with configuration management tools like Ansible and version control systems (Git).
- Working knowledge of cloud platforms, virtualization, auto-scaling, and containerization.
- Strong incident management, analytical thinking, and communication skills.
- Certifications such as RHCE, AWS, or SRE-related credentials are a plus
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
