Site Reliablity Engineer

Guardian

2 - 5 years

Gurugram

Posted: 01/12/2024

Getting a referral is 5x more effective than applying directly

Job Description

Job Description:

The Guardian Technology Operations Team is seeking qualified individuals that will work closely with Development, and DevOps teams to ensure that systems are reliable, scalable, and performant. The role combines software engineering and IT operations to manage infrastructure and create scalable and highly reliable software systems.

Qualifications:

POSITION QUALIFICATIONS

A self-starter individual who is highly collaborative, open, transparent, and bottom line focused. To be successful, this individual must bring both the energy and experience required to drive business aligned change in a complex environment, while building a strong service relationship with their peers in IT. He / She will work closely with their peers, as well as broader technology organization to support the absolute highest levels of quality, availability, and reliability.

The Tech Ops is available 24x7x365 and requires onsite coverage. Shifts can vary across a 24-hour clock. Shifts may change periodically to vary work days.

POSITION OBJECTIVE/SUMMARY

The Guardian Technology Operations Team is seeking qualified individuals that will work closely with Development, and DevOps teams to ensure that systems are reliable, scalable, and performant. The role combines software engineering and IT operations to manage infrastructure and create scalable and highly reliable software systems.

POSITION RESPONSIBILITES

Review AWS Billing and make sure the services being charged accurately reflect company compute usage.
Define, implement, and support a well-governed infrastructure capacity and performance process, supported by forecasting and demand management activities, which ensures consistent service performance, avoids urgent and unplanned investments, and provides consistent information for proactive decision-making.
Transform capacity and performance activities from reactive to proactive, enabling improved visibility for service providers and consumers.
Promote the use of current tools, and drive the evaluation of new tools, to capitalize on the predictive capabilities and anomaly detection.
Collaborate across technical domains to harmonize the collection, analysis, and reporting of performance and capacity data.
Drive the continuous review of capacity and performance metrics, mining the data for optimization opportunities that lower unit costs while limiting incremental risk.
Avoid unplanned and urgent upgrades that could undermine budgets, the planning process, and service owners' credibility.
Minimize unacceptable performance that may impact the business so severely that business processing stops.
Support transformation programs, such as data center consolidation, which may be planned with greater precision and future-proofing when mature capacity planning is practiced
Provide service providers with timely capacity, performance and fault analysis.
Alert on anomalies in performance and capacity before they escalate into service-impacting outages.
Ensure that key metrics are measured, and data collection is consistent across all platforms
Work closely with observability team to ensure end-to-end visibility and rationalize enterprise monitoring tools
Responsible for Configuration Items (CI), CI relationships, and CMDB integration to ensure normalization of data across technical domains and service management.
Responsible for ensuring monitoring is in line with SLA/SLTs
Provide engineering analysis of pending environment changes to ensure SLAs can be maintained.
Support the definition of SLAs, KPIs, and forecasting (demand management) measures for delivering Cloud services.
Partner with adjacent infrastructure and application domain engineers to ensure cohesive, end-to-end solutions meet business objectives in a cost-effective way.
Lead by example, take advantage of peer coaching opportunities to share knowledge and experience with others
Provide third level support to operational teams to ensure Incident and Problem Management is a mechanism that feeds continuous improvement.
Support a culture of strong accountability by reinforcing the need for leveraging consistent inputs and outputs between plan, build, and run functions.
Establish and maintain excellent partnerships with key internal providers of server, storage, network, platform services (middleware & database), systems management, incident, problem, and change, and other key IT services.
Work with application support, enterprise testing and quality assurance to leverage and incorporate their shared services into the infrastructure capacity and performance process.
Supports the attraction, development, and retention of talent through mentoring and leading by example.

REPORTING RELATIONSHIPS

This position reports to the TechOps Team Manager

CANDIDATE QUALIFICATIONS

Functional Skills

Have solid understanding of AWS or any other Cloud provider Infrastructure service offerings.
Bachelor’s degree in computer science or related field or equivalent technical experience
13 years of professional experience. Preferably having depth of IT experience that includes along with 8+ years of experience providing infrastructure engineering services.
operations, engineering, and architecture services in large scale enterprises.
Demonstrated knowledge of key trends and disruptors
Strong product and solution knowledge with a proven ability to drive technology and culture change.
Excellent knowledge of ITIL and service-based delivery models with 7x24x365 operations
Prior experience identifying the need, building the business case, gaining executive support, designing, implementing, and supporting a modern infrastructure capacity and performance management process is highly desirable.
Demonstrated ability to understand and decompose enterprise systems, methodically analyze complex problems, and provide insightful and actionable recommendations.
Strong experience with VMware, VCE (vBlock), EMC, NetApp, Cisco, Microsoft, Red Hat
Excellent track record of structured, logical, and methodical approach to problem solving, data gathering, and analysis.
Good working experience on RHEL Linux Operating system
Good working experience on Windows Operating system
Strong experience with cloud platforms especially AWS services (VPC, EC2, S3, IAM, Cloudtrail, Cloudwatch, various troubleshooting and operational procedures common in AWS eco system),
Proficiency in programming and scripting languages (e.g., Python, Groovy, Bash).
Experience with configuration management tools (e.g., Ansible, Puppet).
Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes) and related application deployment patterns.
Knowledge of CI/CD pipelines and related tools (e.g., Jenkins, ArgoCD, Git Workflows (Pull Requests, Merge conflict resolution etc.), GitLab CI).
Understanding of networking concepts and protocols.
Strong problem-solving skills and ability to troubleshoot complex issues.
Excellent communication and collaboration skills.

Location:

This position can be based in any of the following locations:

Gurgaon

Current Guardian Colleagues: Please apply through the internal Jobs Hub in Workday

About Company

Guardian Life is a U.S.-based mutual life insurance company offering life, disability, and dental insurance products. The company focuses on helping individuals and businesses secure financial protection through its range of insurance solutions, with a strong commitment to customer-centric services.