🔔 FCM Loaded

Staff Site Reliability Engineer

Pocket FM

2 - 5 years

Bengaluru

Posted: 26/02/2026

Getting a referral is 5x more effective than applying directly

Job Description

Staff Site Reliability Engineer


Pocket FM is a leading audio entertainment platform that brings engaging, serialized fiction to millions of listeners across genres like romance, thriller, fantasy, and more. With over 130 million users globally and strong traction in markets like the US and Europe, were revolutionizing storytelling through audio.

Our unique model combines free listening with micropayments for premium content, powering strong business growth. In FY25, we reached an ARR of INR 2,000 crore, with over 100,000 hours of content on the platform. We're also at the forefront of innovation, leveraging AI-generated content to scale efficiently.


Role Overview

We are looking for a Staff SRE to lead reliability engineering efforts while driving AI-native solutioning and platform strategy. This role requires a blend of deep SRE expertise, distributed systems knowledge, applied AI/ML understanding, and strong security fundamentals to build resilient, scalable, intelligent, and secure infrastructure.

You will play a key role in shaping how AI-powered systems are designed, deployed, monitored, optimized and secured across the organization.


The Role: What You Build and Own


SRE & Platform Engineering

  • Design, build, and operate highly reliable, scalable distributed systems
  • Define and implement SLIs, SLOs, SLAs, and error budgets
  • Lead incident management, root cause analysis (RCA), and postmortems
  • Drive an automation-first approach for operations, deployment, and recovery
  • Improve observability (logs, metrics, tracing) across systems


AI-Native Solutioning

  • Architect and implement AI-driven operational workflows (AIOps)
  • Build systems leveraging LLMs, intelligent automation, and predictive analytics
  • Integrate AI into monitoring, alerting, anomaly detection, and remediation
  • Evaluate and adopt AI-powered developer and SRE tooling (e.g., LLM-based copilots, auto-debugging tools)


Information Security & Resilience

  • Embed security-by-design principles into infrastructure and platform architecture
  • Partner with Security teams to implement cloud security best practices (IAM, RBAC, network segmentation, encryption)
  • Lead secure configuration and hardening of Kubernetes clusters and cloud environments
  • Implement and maintain DevSecOps practices across CI/CD pipelines
  • Drive vulnerability management, patching strategy, and secure dependency management
  • Define and monitor security-related SLIs/SLOs (e.g., patch latency, vulnerability remediation time)
  • Implement runtime security, anomaly detection, and threat monitoring for AI and distributed systems
  • Ensure compliance with relevant frameworks (SOC2, ISO 27001, GDPR, etc.)
  • Conduct security reviews, threat modeling, and participate in incident response for security events
  • Secure AI/ML systems, including model security, prompt injection mitigation, data protection, and access controls


Strategy & Leadership

  • Define and drive AI-native SRE strategy and roadmap
  • Partner with engineering, platform, product, and security teams to embed reliability and security by design
  • Mentor engineers and establish best practices for SRE + AI + Security integration
  • Lead initiatives for cost optimization, performance tuning, system resilience, and risk reduction


The Ideal Candidate Who You AreZ


Experience

  • 812+ years in SRE / DevOps / Platform Engineering
  • Proven experience operating production-grade distributed systems at scale


Strong Experience With

  • Cloud platforms (AWS / GCP)
  • Kubernetes & container orchestration
  • Infrastructure as Code (Terraform,etc.)
  • CI/CD systems and automation frameworks


Deep Understanding Of

  • Distributed systems, scalability, and fault tolerance
  • Observability tools (Prometheus, Grafana, Datadog, OpenTelemetry)
  • Incident management frameworks and reliability engineering best practices
  • Cloud security architecture and DevSecOps principles


Programming

  • Strong programming experience in Python / Go


Your AI/ML Toolkit


Hands-On Experience With

  • LLMs (OpenAI, open-source models, etc.)
  • AI/ML pipelines or inference systems


Understanding Of

  • Prompt engineering, embeddings, vector databases
  • AI-driven automation or AIOps platforms
  • Secure AI system design and model lifecycle governance


Experience Integrating AI Into

  • Monitoring / alerting
  • Incident response
  • Developer productivity workflows
  • Security monitoring and anomaly detection

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.