Site Reliability Architect

Pattern

2 - 5 years

Pune

Posted: 3/27/2025

Job Description

Job Description:

Title: Site Reliability Architect 

Job Information 

The Site Reliability Architect (SRA) is responsible for designing and implementing scalable, reliable, and efficient systems that support the organization's software applications and services. As a key technical leader, you will work closely with development, operations, and product teams to ensure that systems are designed with reliability, performance, and scalability in mind. You will also play a crucial role in establishing best practices for site reliability engineering (SRE) and fostering a culture of operational excellence. 

Essential Duties and Responsibilities 

Design and implement robust, scalable, and high-availability systems that meet business and technical requirements. 

Collaborate with software engineering teams to integrate reliability into the software development lifecycle, ensuring that applications are built with operational excellence in mind. 

Develop and maintain service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs) to measure system performance and reliability. 

Lead incident response efforts, including post-mortem analysis and root cause investigations, to improve system reliability and prevent future incidents. Automate operational processes to improve efficiency and reduce manual intervention, leveraging tools and technologies such as Infrastructure as Code (IaC). 

Monitor system performance and reliability using appropriate metrics and monitoring tools, proactively identifying and addressing potential issues. Advocate for and implement best practices in site reliability engineering, including capacity planning, disaster recovery, and incident management. Train and mentor engineering and operations teams on SRE principles and practices, fostering a culture of continuous improvement. 

Qualifications 

Bachelor's or Master’s degree in Computer Science, Engineering, or a related field. 

8+ years of experience in software engineering, systems engineering, or site reliability engineering. 

Strong understanding of cloud computing platforms (e.g., AWS, Azure, Google Cloud) and container orchestration technologies (e.g., Kubernetes, Docker).

Experience with configuration management and automation tools (e.g., Terraform, Ansible, Puppet). 

Proficient in programming and scripting languages (e.g., Python, Go, Bash) for automation and tool development. 

Extensive knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack) and practices. 

Solid understanding of networking concepts, distributed systems, and microservices architecture. 

Excellent problem-solving skills and the ability to work effectively under pressure. 

Required Skills and Abilities 

● Leadership Skills: Ability to lead cross-functional teams and drive initiatives that enhance system reliability and performance. 

● Interpersonal Skills: Self-motivated, team player, builds trust, action and results-oriented; open and collaborative style; comfortable working in a dynamic environment. 

● Communication Skills: Strong written, oral, and presentation skills, with the ability to effectively communicate technical concepts to non-technical stakeholders. 

● Attention to Detail: Thoroughness in accomplishing tasks, ensuring accuracy and quality in all aspects of work. 

● Analytical Skills: Strong analytical and troubleshooting skills, with the ability to think critically and make data-driven decisions. 

Our Core Values 

● Data Fanatics: Our edge is always found in the data 

● Partner Obsessed: We are obsessed with partner success 

● Team of Doers: We have a bias for action 

● Gamechangers: We encourage innovation

Pattern is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

About Company

Pattern is a data-driven e-commerce growth platform that helps brands optimize their online sales and performance across various marketplaces, such as Amazon, Walmart, and Shopify. The company uses advanced analytics, artificial intelligence (AI), and machine learning to provide actionable insights and strategies that drive growth, improve product visibility, and increase profitability. Pattern's services include inventory management, pricing optimization, advertising campaigns, and market intelligence, allowing brands to make data-backed decisions. By leveraging real-time data, Pattern assists brands in scaling their operations and navigating the complexities of digital commerce, ensuring they stay competitive in the rapidly evolving e-commerce landscape.

Services you might be interested in

One-Shot Campaign

Reach out to ideal employees in one shot!

The intelligent campaign for reaching out to the ideal audience to whom you can ask for help (guidance or referral).