Title : Observability Platforms and SRE Engg.

The Company : World of Kotak product suite encompasses a powerful suite of cross banking assets, all-in-one stop banking services, securities, and investment banking; insights across a wide spectrum of the major financial and banking markets.

The Team : You will be working with a team of highly seasoned set of Observability Platform and Site Reliability Engineers part of the Run-The-Bank initiative to deliver Engineering and Technology Operations Excellence for Kotak Banking Product Suite and associated delivery platform.

The Observability Platforms and SRE team is a group of experts developing, maintaining, scaling Observability Platform solutions, driving engineering and automation within the Banking Solutions platform and operation in onPrem and the cloud.

We are looking for a highly motivated individual to take on a role of a Observability Platform Engg. and SRE to help implement our platforms using Open-Source and Enterprise solutions, through IaC, automated operations and configuration management, bringing together observability, and engineering for architecture and operational excellence.

The role will have to develop, test, validate software and hardware systems that enable our Observability Platform. Coordinate the processes and tools to support site stability, resilience and performance of the banking system that is capable of supporting multiple business requirements across an array of technologies. The Engineer will work across Architecture, development, Infrastructure and vendor teams to deliver and support the Observability Platform and SRE guided processes and tools supporting the banking systems.

Impactfulness : The team has an opportunity to advocate and participate in building engineering services that are resilient, optimally monitored, alerted and capability to self-heal thorough reliability engineering practices using software and runbook automation tools to deliver world class banking and related content globally.

• Observability Platform engineers will implement site-wide Observability solutions for metrics, logs, traces, alerting and monitoring to be used by development and business teams across the org to monitor their systems and applications. Site Reliability Engineering (SREs) is responsible for keeping all user-facing services, user journey and other Kotak production systems running smoothly.

• Said engineers should be a match of software engineers and pragmatic system engineers that embed operational discipline with engineering principles, and mature automation and documentation to our operating environments and associated Kotak Code Base.

• Said engineers would have expertise in systems (networking, operating systems, storage, etc), while implementing best practice guidelines for stability, availability, reliability and scalability while keeping the compute and cost factor optimal.

• Kotak Platforms are critical applications that have unique used cases and challenges associated that would need to be optimized over time with re-engineering and revised tools and practices.

What’s in it for you: / Role : An Observability Platform Engg and SRE is ultimately accountable for building, maintaining and scaling an Observability Platform that can be used by various systems across the Org. They are also accountable for system reliability, resiliency, scalability and reducing time to market by striving to improve end to end service and reduce technical debt. We seek leaders who are passionate about observability and system reliability to influence and drive the strategic platform mission and maturity.

Your mission will be to ensure our services are fast, highly available, and run efficiently through scaling optimally during peak business traffic and load. Your focus would be to solve production problems across the stack going up to the edge. Gain critical domain knowledge to effectively troubleshoot symptoms that impair health leading to performance degradation or service outages. The position requires the flexibility to take a holistic approach to troubleshooting and the ability to deep dive into core technical details working with various development, infra and vendor teams. Build automation tools and processes for system health and acceptance tests to validate changes in lower environment leading to production changes. The Systems Reliability Engineer will ensure the system is well instrumented and highly fault tolerant with proper metrics to report upon.

Key Leadership Responsibilities:

• Influence and drive engagement on Observability and SRE practices with development, engineering and product groups to align solution delivery with technology services.

• Build quality engineering practices around automation through well-defined processes and monitoring metrics that exhibit process quality.

• Conduct transparent and effective blameless post mortems and ensuring Post Incident Reviews have clear Root Cause and Actions with Problem tickets and closures.

• Deliver on availability, latency, performance, scalability of Kotak applications by evangelizing engineering principles into development lifecycle with a template on fault tolerant at each level.

• Drive non functional requirement review including capacity planning, cost analysis and instrumentation integration to provide complete delivery cycle.

• Define Observability and SRE initiatives, tasks and report to all stakeholders, business and build a onboarding template for new and future applications.

• Implement metrics driven approach towards service quality targets.

Basic Qualifications : 7+ years system & solutions engineering, software development, or technology operations background with 3+ years work experience working as a Systems Engineer, DevOps and/or SRE Roles.

• Experience automating infrastructure, testing, and deployments using tools like Terraform, CFT with Jenkins, Ansible, Chef & other industry recognized tools to deliver Infrastructure as Code.

• Relevant work experience or familiar with languages / web technologies (Python, Java,C, C++, ASP.NET, JavaScript, Go etc)

• Experience with 2 or more scripting languages such as python, perl, unix shell, powershell, groovy, etc...

• Experience with AWS technologies: VPC, EC2, EKS, ELB, RDS, Lambda, SES, SNS, Containers, etc.

• Experience with any identity management systems such as (SAML/OAuth), MFA, etc.

• CI/CD delivery using code and configuration management automation tools such as GitHub, VSTS, Ansible, DSC, Puppet, Ambari, Chef, Salt, Jenkins, Maven, etc.

• Delivery using modern methodologies especially SAFE Agile, Lean, etc.

• Experience with networking protocols, CDN, App acceleration, Load Balancers, DNS, VPN, PaaS, IaaS, etc.

• Experience with troubleshooting networking protocols such as TCP/IP, HTTPS/ TLS/ Websockets, Multicast and Broadcast messaging.

• Experience with cloud infrastructure, storage, platforms, data and with containers (Kubernetes, Container, Docker, virtualization).

• Experience with monitoring and observability such as with Grafana, Prometheus, Datadog, Splunk, AppDynamics, New Relic, and Nagios, etc.

Preferred Qualifications:

• Bachelor's/Master’s Degree in Computer Science, Information Systems, or equivalent

• AWS Certified Solution Architect – Professional/Associate

• Good Leadership skills capable of leading a team.

• Good communication skills and a sense of ownership and drive.

• Have a software-centric mindset and can understand the full software stack – and beyond.

• Embrace automation over manual effort, debugging complex problems and view problems as an opportunity to improve.

• Experience designing, building, and operating large-scale production systems

• Experience working in enterprise-scale internal or customer-centric projects.

• Experience working closely with development & engineering teams.

• Good understanding of software development lifecycle (SDLC) and Software Testing in an Agile/Scrum framework.

• Strong analytical thinking, problem solving, oral and written communication skills.

• Experience working with multiple stakeholders and vendors at various levels.

• Understanding of SQL and databases, should be comfortable in writing SQL queries

• Hands on doing operational automation using any automation framework.

• Good knowledge of working with SOAP, REST services and SOA architecture.

• Knowledge of testing in continuous integration/DevOps models is a plus.

• Understanding of Cloud technologies like AWS/Azure and micro-services, containers.

• Experience in DevOps, Big Data Testing, IOT, Cloud will be added advantage.

• Experience automating infrastructure, testing, and deployments using Terraform, CFT with Ansible, Rundeck, Autosys, Jenkins to deliver Infrastructure as Code.

• Experience working with the Rundeck tool (Design, Setup, Deployment, Automation & Integration)

• Terraform / Kubernetes / Ansible expertise a plus

Responsibilities:

• Experience with maintaining SLA 99.99% of the Banking Platform and Applications.

• Experience in troubleshooting and resolving incidents and using problem management to bring about service improvement using automation to drive resiliency and stability.

• Experience in service restoration through standard automized tools and engineering processes to reduce our downtime and improve our SLA/SLI/SLO metrics.

• Creating production and migration schedules for large projects with timelines/milestones

• Develop and leverage AWS tools and services to manage and automate key operations capabilities.

• Proactively ensure the highest levels of systems and infrastructure availability

• Monitor and test application performance for potential bottlenecks, identify possible solutions and work with developers to implement those fixes.

• Write and maintain custom scripts to increase system efficiency and reduce human intervention time on tasks.

• Increase alerting & monitoring quality, Reduce Alarm noise, and Increase Observability Gaps

• Optimize Cloud Costing and analyse Capacity Planning

• Reduce Operations exposure, Increase the pace of incidents recovery, and Implement Resiliency and remediation plans

• Identifying and correcting problems stemming from audit and compliance.

• Liaise with vendors and other IT personnel for problem resolution

Performance Indicators : Observability Platform and Site Reliability Engineers have the following performance indicators:

• Platform adoptability, availability, scalability and performance

• Tech Dashboard

• Site Availability, Performance

• Mean Time to Detection

• Mean Time to Resolution

• Mean Time Between Failure

• Mean Time to Production

• Disaster Recovery Time to Recovery

• Change Success / Failure Metrics

Soft Skills : Communication is core to the success of this role

Evangelize adoption and use of tools, processes and technologies

Lead engagements to encourage collaboration within and across teams

Showcase roadmap and engagement model to relevant stakeholders; through write up, teams groups and webinars

Documentation is core to maintain up to date information on use of tools, process and methodologies. [eg: wiki posts, Confluence write ups]

Create internal training programs for new staff and upskilling of existing team

Demonstrate humility, trust and transparency in the way we interact with individuals

Dev Ops Engineering III-SUPPORT SERVICES-Applications-CTB

Kotak Mahindra Bank

Job Description

About Company

Services you might be interested in

One-Shot Campaign