Job Summary
Assess and measure performance resiliency and reliability of apps with focus on Observability and monitoring practices like SLAs SLOs etc
Assess current state and identify opportunities for automation and toil reduction
Work with team to define SRE metrics like SLA SLO SLI RTO RPO M
Responsibilities
Assess and measure performance resiliency and reliability of apps with focus on Observability and monitoring practices like SLAs SLOs etc
Assess current state and identify opportunities for automation and toil reduction
Work with team to define SRE metrics like SLA SLO SLI RTO RPO MTTD & MTTR Error Budget
Improving the resiliency of the applications through robust cloud deployment methods
Enhancements to improve the operational efficiency of production incidents / problem tickets that
Database Hygiene activities including retention policy implementation patching upgrade disk space performance etc.
Certificate Management new cert creation renewal revoke
Production and Non production environment infrastructure level support CPU usage memory usage disk usage
Infra upgrade activities like EKS upgrade Dynatrace upgrade etc.
Planning and supporting DR System resiliency events
Identifying FinOps(Infra cost saving ) opportunities and implement .
Messaging platform version upgrade like MQ version cipher upgrade etc.
Documentation
Java Spring Boot Microservice
Postgres Oracle
Experience in dealing with monitoring tools such as Splunk Dynatrace AppDynamic Grafana
Containers Kubernetes docker
Kafka MQ
AWS Cloudwatch EC2 EKS Lamda Terraform
GitHub Jules Spinnaker
Familiarity with agile methodologies and experience working in an Agile/Scrum environment.