🔔 FCM Loaded

Site Reliability Engineer

ValueFirst

3 - 5 years

Gurugram

Posted: 17/02/2026

Getting a referral is 5x more effective than applying directly

Job Description

About the Job

The Site Reliability Engineering (SRE) team is responsible for ensuring the reliability, scalability, and performance of large-scale telecom and CPaaS platforms. This role combines software engineering and systems operations to build resilient, observable, and automated infrastructure that supports high-throughput messaging services. The team operates in a 24/7 environment and works closely with Engineering, CX and Products to maintain carrier-grade service reliability.


What youll be responsible for

  • Ensure high availability, performance, and reliability of CPaaS production systems speread across mutiple locations hosted over cloud and data centers
  • Own and improve SLIs, SLOs, and SLAs for messaging platforms and supporting services.
  • Monitor system health, latency, TPS, error rates, and delivery metrics using observability tools.
  • Participate in on-call rotations and handle production incidents with a focus on fast recovery and root cause analysis.
  • Deploy, configure, and optimize for high-throughput messaging (multiple channels)
  • Troubleshoot telecom-specific issues including DLR failures, encoding problems, TPS drops and routing issues.
  • Work directly with multiple teams for integrations, testing, and incident resolution.
  • Perform packet-level analysis using tcpdump and Wireshark to diagnose network and protocol-level issues.
  • Write and maintain shell scripts and automation to eliminate repetitive operational tasks and reduce human intervention.
  • Contribute to infrastructure automation using tools like Ansible and CI/CD pipelines where applicable.
  • Improve deployment, configuration, and rollback processes for messaging services.
  • Design and enhance monitoring, alerting, and dashboards using tools such as Datadog, Site24x7, ELK and Grafana.
  • Administer and troubleshoot Linux based servers in production environments.
  • Manage and optimize MySQL and MongoDB databases including performance tuning, backups, and recovery.
  • Works on API's and webhooks across the product & services. Its enhancements and troubleshooting.
  • Maintain web and application servers such as Apache, Nginx, and jboss (WildFly)
  • Support cloud-based and virtualized environments with exposure to auto-scaling and containerization concepts.
  • Collaborate with engineering teams on release planning, production deployments, and post-release validation.
  • Lead or contribute to incident response & RCA focusing on long-term reliability improvements.
  • Track issues, changes, and reliability work using Jira and related tools.


What youd have

  • B.Tech / B.E in Computer Science or related field with 23 years of experience in SRE, DevOps, telecom, or CPaaS operations.
  • Hands-on experience with SMS gateways and messaging workflows.
  • Solid understanding of Linux systems, networking fundamentals, and production troubleshooting.
  • Strong experience with MySQL & MongoDB administration, queries, and performance optimization.
  • Proficiency in shell scripting and a mindset toward automation and reliability engineering.
  • Hands-on experience with tcpdump, Wireshark, and protocol-level troubleshooting.
  • Experience with monitoring, logging, and alerting systems (Datadog, ELK, Grafana, Site24x7, etc.).
  • Familiarity with configuration management tools like Ansible and version control systems (Git).
  • Working knowledge of cloud platforms, virtualization, auto-scaling, and containerization.
  • Strong incident management, analytical thinking, and communication skills.
  • Certifications such as RHCE, AWS, or SRE-related credentials are a plus

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.