Job Summary
Site Reliability Engineer with proficiency in Cloud DevOps and Application observability
Responsibilities
Apply technical knowledge and problem-solving methodologies to projects of moderate scope with a focus on improving the data and systems running at scale and ensures end to end monitoring of applicationsResolves most nuances and determines appropriate escalation pathBuild support Monitor and Automate web product on Private Cloud infrastructureDemonstrates and champions site reliability culture and practices and exerts technical influence throughout your teamDrive initiatives to improve the reliability and stability of web Hosting platforms using data-driven analytics to improve service levelsCollaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customersDemonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology related bottlenecks in your areas of expertiseCollaborates with technical experts key stakeholders and team members to resolve complex problemsProvides comprehensive and ongoing guidance tools and solutions to support the firms growthWorks toward becoming an expert on the applications and platforms under your influence while understanding their interdependencies and limitationsDocuments and shares knowledge within your organization via internal forums and communities of practiceStrong knowledge of one or more infrastructure disciplines such as hardware networking terminology databases storage engineering deployment practices integration automation scaling resilience and performance assessmentsExperience with multiple cloud technologies with the ability to operate in and migrate across public and private cloudsDrives to develop infrastructure engineering knowledge of additional domains data fluency and automation knowledgeCloud Exposure - Understanding and working experience and understanding of resiliency scalability observability monitoring etcUnderstanding of the Data Objects & Structure and write the queries using SQL based on tickets as neededExperience as SRE in complex and mission critical applications involving multitude of components of varying technical generationsDeep proficiency in reliability scalability performance security enterprise system architecture toil reduction and other site reliability best practices with the ability to implement these practices within an application or platformStrong knowledge in site reliability culture and principles with demonstrated ability to implement site reliability within an application or platformStrong knowledge and experience in observability monitoring alerting and telemetry collection using tools such as Cloudwatch Grafana Dynatrace Prometheus Splunk etcFluency in at least one programming language such as Python Terraform Ansible Java Spring Boot Shell Scripting DotNet etc
Certifications Required
SRE related certifications are preferred but not mandatory