Site Reliability Engineer - JD :
Location - Chennai and Hyderabad
Experience - 9 to 14 Years
Key Responsibilities:
· Design, Implement and/or refine Service Management processes. (Monitoring, Incident, Problem, Capacity, Change & Releases and Service Level Management)
· Track system health, performance and reliability via monitoring, observability platforms, implement proactive alerting mechanisms to detect anomalies and respond swiftly to incidents.
· Act as a point of escalation for complex incidents, collaborating with senior engineers and management to ensure effective resolution.
· Establish and enforce change control and release management processes to ensure smooth and controlled deployment of system changes.
· Conduct post-incident analyses to identify root causes and implement actions to prevent recurrence and improve system resilience.
· Perform regular system testing to identify vulnerabilities and validate disaster recovery plans.
· Partner with development teams to improve services through rigorous testing and release procedures.
· Participate in system design consulting, platform management, and capacity planning.
· Integrate reliability practices into CI/CD pipelines to automate testing, quality assurance, and deployment processes.
· Foster a culture of collaboration between development and operations teams, promoting shared ownership and accountability for system reliability.
· Create sustainable systems and services through automation and uplifts.
· Balance feature development speed and reliability with well-defined service-level objectives
· Continuously evaluate and enhance system reliability, scalability and performance. Identify areas for improvement and implement solutions to optimize processes and reduce manual toil.
· Define, track, and monitor SLAs/ SLOs to measure and improve system reliability.
- Collaborate with cross-functional teams to ensure scalable and adequate resource allocations and optimize cost efficiency.
Required skills and qualifications
· Bachelor’s degree (or equivalent) in computer science or related discipline
· Proven Process definition and Implementation experience, leveraging ITIL best practices
· Minimum ITIL V3 Intermediate / Expert certified - Mandatory
· Implementation experience of ITSM / ESM tools (e.g., SNOW, Remedy, JIRA)
· Strong DevSecOps skills with implementation experience – Foundation / Practitioner certification will be an advantage.
· Coding experience beyond simple scripts – Python, Java, C/C++ and JavaScript
· Knowledge of Linux/ Unix systems administration and troubleshooting skills
· Knowledge of relational and NoSQL databases and distributed storage systems Proficiency in database administration, query optimization, and data replication.
· Familiarity with Incident management and collaboration tools such as JIRA, PagerDuty, Slack, or ServiceNow.
· Expertise in performance monitoring and analysis tools such as New Relic, AppDynamics, or Datadog.
· Familiarity with configuration management tools like Ansible, Puppet, or Chef
· Knowledge of Observability (e.g, Dynatrace, SolarWinds) and monitoring systems (e.g., Prometheus, Nagios) and log management tools (e.g., ELK stack, Splunk).
· Strong analytical thinking and problem-solving abilities to identify patterns, troubleshoot issues, and propose effective solutions.
· Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.
· Previous success in technical engineering