Looking for a Manager, Site Reliability Engineering to help us scale our systems and ensure

stability, reliability and performance and rapid deployments of our platform. We build teams that

are inclusive, collaborative, and have a strong sense of ownership for the things they build. If you

have a passion and track record for solving problems; moreover, have strong leadership skills, this is a great fit for you.

As Manager, SRE you will demonstrate both emerging and current technologies, methods, and

processes contributing to the evolution of software deployment processes, enhancing security,

reducing risk, and improving the overall end-user experience. As part of the Technology R&D Team, you will play an integral part in advancing DevOps maturity and be a part of a new culture of quality and site reliability. You will continually improve our CI/CD tools, processes, and procedures. You will also be responsible for regular reporting to Senior Technology Leaders and providing updates on organizational risk exposure and risk related issues.

What You Will Be Doing:

Set the direction and strategy for your team, and help shape the overall SRE program for the

company

Support the growth by ensuring a robust, scalable, cloud-first infrastructure

Own site stability, performance and capacity planning

Participate early in the SDLC to ensure reliability is built in from the beginning, and creating

plans for successful implementations/launches

Foster a learning and ownership culture within the team and the larger organization

Ensure best engineering practices through automation, infrastructure as code, robust system

monitoring, alerting, auto scaling, self-healing, etc...

Manage complex technical projects and a team of SREs

Recruit and develop staff; build a culture of excellence in site reliability and automation

Lead by example roll up your sleeves by debugging and coding; participate in on-call rotation

& occasional travel

Represent the technology perspective and priorities to leadership and other stakeholders by

continuously communicating timeline, scope, risks, and technical road map

What You Will Need for this Position:

10+ years of hands-on technical leadership and people management experience

3+ years of demonstrable experience leading site reliability and performance in large-scale,

high-traffic environments

Strong leadership, communication and interpersonal skills geared to getting things done

Developing themselves and the talent within their charge fostering and creating

opportunity for the team

Architect-level understanding of one or more of the major public cloud services (AWS, GCP or

Azure), using them to effectively design secure and scalable services

Strong understanding of SRE concepts and the DevOps culture, with a focus on leveraging

software engineering tools, methodologies and concepts

In-depth understanding of automation and CI/CD processes to go along with excellent

reasoning and problem-solving skills

Experience with Unix/Linux environments with a deep grasp on system internals

Worked on large-scale distributed systems including multi-tiered architecture

Strong knowledge of modern platforms like Fargate, Docker, Kubernetes etc.

Experience working with monitoring tools (Datadog, NewRelic, ELK stack, etc) and Database

technologies (SQL Server, Postgres and Couchbase preferred)

Validated breadth of understanding and development of solutions based on multiple

technologies, including networking, cloud, database, and scripting languages.

Experience in prompt engineering, building AI Agents, or MCP is a plus.

Site Reliability Engineering Manager

People Hire Consulting

Job Description

Services you might be interested in

Improve Your Resume Today