Role: Site Reliability Engineer (SRE) Core IT Infrastructure

Location: Hyderabad

Work mode: On-site (full Time)

Experience: 6+ year's

Key Responsibilities

Infrastructure Reliability & Operations

Design, implement, and maintain highly available and fault-tolerant infrastructure

Ensure reliability, performance, scalability, and security of core IT systems

Monitor system health, capacity, and performance using proactive observability practices

Lead incident response, root cause analysis (RCA), and post-incident reviews

Automation & SRE Development

Develop and maintain automation tools, scripts, and frameworks to reduce manual operations

Apply Infrastructure as Code (IaC) principles using tools such as Terraform, Ansible, or CloudFormation

Build self-healing systems and automate repetitive operational tasks

Improve deployment pipelines and operational workflows through engineering solutions

DevOps & Platform Engineering

Collaborate with DevOps, development, and security teams to support CI/CD pipelines

Enable seamless application deployments with minimal downtime

Support containerized and orchestration platforms (Docker, Kubernetes, OpenShift)

Implement best practices for configuration management and environment consistency

Monitoring, Observability & Performance

Design and maintain monitoring, logging, and alerting systems

Define and track SLIs, SLOs, and SLAs

Optimize system performance, capacity planning, and cost efficiency

Enhance observability using tools such as Prometheus, Grafana, ELK, Datadog, or similar

Security & Compliance

Implement infrastructure security best practices

Collaborate with security teams on vulnerability management and compliance requirements

Ensure secure access, identity management, and audit readiness

Required Skills & Qualifications

Technical Skills

Strong experience in Linux/Unix system administration

Proficiency in programming/scripting (Python, Go, Bash, Shell, or similar)

Experience with cloud platforms (AWS, Azure, or GCP)

Hands-on experience with containerization and orchestration

Knowledge of networking concepts (DNS, TCP/IP, load balancing, firewalls)

Experience with monitoring, logging, and alerting tools

Site Reliability Engineer (SRE) – Core IT Infrastructure