Job Summary
We seek a highly skilled talented and experienced Site Reliability Engineer. The ideal candidate will work closely with software engineers operations teams and product managers to automate operations monitor performance and proactively resolve issues. As a Site Reliability Engineer (SRE) the individual will play a critical role in maintaining the stability scalability and reliability of our systems and services. The primary focus will be improving service availability reducing the tim
Responsibilities
We seek a highly skilled talented and experienced Site Reliability Engineer. The ideal candidate will work closely with software engineers operations teams and product managers to automate operations monitor performance and proactively resolve issues. As a Site Reliability Engineer (SRE) the individual will play a critical role in maintaining the stability scalability and reliability of our systems and services. The primary focus will be improving service availability reducing the time to detect issues and creating efficient workflows to handle production challenges.
Ensure high availability and performance of production systems by monitoring and responding to issues.
Take ownership of major incidents manage communication and perform post-incident root cause analysis to prevent future occurrences.
Analyze and optimise system performance by fine-tuning services hardware and software configurations.
Implement monitoring tools like Azure Monitor Log Analytics and KQL to detect system performance issues and capacity bottlenecks.
Collaborate with DevOps and software teams to maintain and improve Continuous Integration (CI) and Continuous Delivery (CD) pipelines.
Work closely with software development teams to design scalable and maintainable services and implement infrastructure-as-code (IaC) solutions.
Implement security best practices for cloud infrastructure and production environments ensuring that applications are secure reliable and compliant with industry standards.
Create and maintain comprehensive documentation for troubleshooting runbooks and procedures for systems.
Collaborate with internal stakeholders to align the solution with business needs.
Work as part of an existing team of DevOps engineers and able to lead and mentor where necessary.
Excellent problem-solving skills and the ability to work effectively in a collaborative team environment.
Strong knowledge and experience of working with Azure.
Experience with configuration management tools like Ansible Chef Puppet or SaltStack.
Familiarity with containerisation and orchestration technologies such as Docker and Kubernetes.
Strong scripting skills in Powershell Azure CLI or equivalent for automation and system management.
Hands-on experience with monitoring and alerting tools like Prometheus ELK Stack or Nagios.
Experience with version control systems preferably Git.