Sr. Manager, Site Reliability Engineering (SRE)

Calix

5 - 10 years

Bengaluru

Posted: 07/06/2025

Job Description

Your Mission as SRE Manager
As an SRE manager, you are responsible for the availability and reliability of Calix’s cloud. At Calix, Site Reliability Engineering combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. You would be responsible for leading a team of Site Reliability Engineers, overseeing the reliability, scalability, and maintainability of Calix's critical infrastructure, including building and maintaining automation tools, managing on-call rotations, collaborating with development teams, and ensuring systems meet service level objectives (SLOs), all while prioritizing continuous improvement and a strong focus on infrastructure health and stability within the Calix platform, leveraging tools like Terraform, observability frameworks from the Grafana Labs ecosystem, and Google Cloud Platform.

Key Responsibilities

SRE Leadership:

Manage and mentor a team of SREs, managing weekly sprints, providing technical guidance, fostering a collaborative environment to achieve team goals, and focusing on building a culture of high performance. Collaborate with your peers in Platform Engineering and Application Development to ensure the reliability of what gets deployed to production. This is a hands-on roll that requires coding, code reviews, and strong technical guidance.

Monitoring and Alerting:

Utilize monitoring systems to proactively identify potential issues and act on them immediately before they become disruptive. Eliminate red blindness and ensure high fidelity, actionable alerts by adhering to best practices for alert implementations and thresholds. Continually optimize for better observability and actionable alerting.

Reliability Engineering:

Build a culture of reliability by collaborating with Platform Engineering and Application Development teams at design time on through to implementation and test. Enforce reliability and resilience by ensuring systems are built to be HA through proper design, code reviews, and rigorous testing of modes of failure.

Performance and Scalability Optimization:

Identify bottlenecks using profilers and distributed tracing frameworks. Implement performance improvements across Calix's infrastructure. Work cross-functionally with development teams to guide them toward better performance, scalability, and cost efficiency.

Capacity Planning:

Proactively monitor system performance and capacity, identifying potential bottlenecks and scaling systems as needed. At Calix, we are constantly growing, making sure that we are scaling appropriately is an area of constant focus.

Automation Development:

Drive the development and implementation of automation tools to streamline operations, including deployment pipelines, monitoring, and self-healing mechanisms.

Incident Management:

Participate in an on-call incident manager rotation. Lead incident response, drive root cause analysis, hold blameless post-mortem reviews, and work cross-functionally to implement corrective and preventative actions. Ensure that incidents never repeat.

Implement and Enforce SLI’s/SLO’s:

Ensure that all service endpoints and critical user journeys are monitored, visualized, have alerts, and have associated SLO’s. Work closely with development teams, product owners, and other stakeholders to ensure alignment and enforcement of SLO’s and error budgets.

On-Call Management:

Establish and manage on-call rotations for the SRE team, ensuring timely response and resolution to system alerts and incidents. You will blend skills and experience levels to ensure a well-rounded team of responders capable of handling a diverse range of production issues. Clearly define the duties of on-call staff. This includes outlining their responsibilities for monitoring alerts, maintaining playbooks, eliminating toil, handover protocols, troubleshooting incidents, escalating issues, and collaborating with other teams.

Qualifications:

Strong experience as an SRE manager with a proven track record of managing large-scale, highly available systems.
Expertise in cloud computing platforms (preferably Google Cloud Platform).
Knowledge of core operating system principles, networking fundamentals, and systems management.

Programming skills in languages like Python and Go.
Proven experience building and leading SRE teams, including hiring, coaching, and performance management.
Deep understanding and expertise in building and maintaining scalable open-source monitoring tools and backend storage.
Experience with incident management processes and best practices.
Excellent communication and collaboration skills to work with cross-functional teams.
Knowledge of SRE principles, including error budgets, fault analysis, and reliability engineering concepts.

Education:

B.S. or M.S. in Computer Science or equivalent field.

About Company

Calix, Inc. is a cloud and software platform company headquartered in San Jose, California. It specializes in providing cloud-based software, systems, and services that enable broadband service providers to simplify operations, deliver exceptional subscriber experiences, and grow their businesses. Calix’s solutions focus on empowering communication service providers to optimize their networks, leverage advanced analytics, and create personalized customer experiences. Known for its innovation in broadband technology, Calix helps its clients transition to next-generation networks, ensuring scalability, efficiency, and improved customer satisfaction.

Sr. Manager, Site Reliability Engineering (SRE)

Calix

Let experts apply while you prepare for interviews

Job Description

About Company

Services you might be interested in

We Search & Apply Jobs for You!