Job Summary
Incident Management: Lead and manage high-priority incident responses with a sense of urgency and efficiency. Analyze troubleshoot and resolve complex system issues spanning across multiple technology stacks.FMEA: Perform Hot spot analysis and Service map Analysis to Identify potential risks and Vulnerabilities for FaultsNFRs and Quality Gates: Refine and Drive NFRs and Quality gates including Safe release patters and Reliability and Resiliency requirements in Releases.
Responsibilities
Full-Stack Expertise: Use extensive knowledge of both front-end and back-end technologies to understand and debug system issues quickly. Implement solutions that encompass all layers of the application and infrastructure stack.Enterprise Systems Knowledge: Use deep understanding of enterprise-level network and middleware technologies to find root causes of incidents and provide sustainable solutions.Problem Management: Drive continuous improvement initiatives by analyzing incident trends finding recurring issues and implementing initiative-taking measures to enhance system reliability and performance. Review and refine SRE standards and processes focusing on incident response and reducing toil.Collaboration: Work closely with development operations and other IT teams to ensure cohesive and effective incident management. Facilitate post-incident reviews and share learnings across the organization.Automation and Tooling: Develop and implement automation tools and scripts to streamline Diagnostic package incident response and resolution processes. Provide feedback on Enhanced monitoring and alerting systems to detect issues proactively.Documentation: Keep detailed documentation of incidents resolutions and system changes to ensure knowledge sharing and compliance with IT governance standards.Observability & Self Heal: Provide leading indicators and Drive Observability maturity. Drive Development Self heal capabilities with Various teams.Assess and measure performance resiliency and reliability of apps with focus on Observability and monitoring practices like SLAs SLOs etc
Experience in Dynatrace
Configure monitoring and logging of systems in order to obtain better visibility
Help design processes that automatically evaluate system SLA
Be proactive identify and remediate issues before SLAs are violated
Tool-agnostic and approach-centric
Required Knowledge/Skills Education and Experience
8-12 years in the software industry with 4+ years in an SRE or DevOps role
Profound knowledge of full-stack technologies legacy servers middleware cloud platforms (AWS) containerization technologies (e.g. Docker Kubernetes) and databases (SQL NoSQL)
Experience with container management and infrastructure monitoring tools
Expertise in enterprise network architectures protocols middleware technologies and API management Tool / process
Programming skills in high-level languages like Python Java Ruby or JavaScript
Automation experience with scripting and API development (e.g. Ansible Terraform Shell Python)
2+ years with observability tools and containerization
Preferred Knowledge/Skills Education and Experience
Experience with AWS Terraform CloudFormation and incident tracking tools.
Certifications in AWS Observability and monitoring tools
Experience with log management tools
Ensure system reliability getting systems back to steady-state as quickly as possible