Job Summary
We are seeking an experienced Infra Dev Specialist with 6 to 10 years of experience in SRE Grafana ELK Dynatrace AppMon and Splunk. The ideal candidate will have a strong background in Cards and Payments domain. This hybrid role requires a proactive individual who can work effectively in a dynamic environment ensuring the reliability and performance of our infrastructure.
Responsibilities
Responsible for participating in sprint planning and creating backlog of SRE user stories A1A17
Work with the team to define the SRE metrics like SLA SLO SLI RTO RPO MTTD MTTR Error Budget
Track outstanding defects
Create dashboards and alerts required for SRE Metrics compliance analysis
CTBSREQA Tune app monitoring
Assess current state and identify opportunities for automation and toil reduction
Design and implement framework to categorize prioritize tasks that can be fully automated Semi automated avoided
Work with SRE leads Ops leads for creating backlog of Automation user stories and operate in sprint mode
Selfhealing scripts maintained validated in QA before prod
Linux windows server admin and maintenance patching in sprint applying and managing LicensesCertificates and their renewals in QA environment
Implementation of network infrastructures consisting of servers networking backup and disaster recovery software and network devices
Advanced knowledge of commonlyused concepts practices and procedures within technical support and Windows Administration
Advanced knowledge on App Server Web Server Databases Starting Stopping Server Looking at the logs etc
Intermediate scripting skills and PowerShell scriptingAutomation experience
Experience monitoring tools Dynatrace Glassbox and Grafana
Intermediate knowledge in SCM tools like ServiceNow and Jira
Prioritize workflow document processes organize diverse material and handle multiple competing and changing priorities
Conduct Review calls with PNC Infra and Application Team to understand current state
Install Certificates
Perform validation following procedures after Agile Patching infrastructure
Manage Incidents acting as a critical incident manager troubleshooting escalating and resolving issues following a time based incident management process
Perform manual health checks automated health checks and communicate overall environment health in communication channels including sending a turnover email at the end of shift
Perform change validation following procedures
Perform OpenShift workload management during bulk load operations
Support OpenShift issues restart services troubleshoot errors and resolve
Create update monitoring dashboards
Create update alerting and automation solutions
Take additional responsibilities as needed
Work Jira tickets for assigned tasks
Understand current infrastructure VMs OpenShift Kafka streaming
Use and understand monitoring and logging tools Dynatrace Logscale Grafana OpenShift Console etc