Looking for Site Reliable Engineer with below skills

Skills Required:

SQL, NOSQL, Nagios, Cloudwatch, Zabbix, Datadog, New Relic, Prometheus, Grafana,

App Dynamics, Site24x7, Telemetry, Splunk, CI CD, CI/CD, CICD, DevOps, Kentico,

SRE, Site Reliability, AIOps, Agentic, GEN AI, AI, ML

Experience Range:

10 - 16 years

Key Responsibilities:

Design, develop and maintain observability, monitoring, and alerting systems for AI

platforms and mission-critical backend services.

Design telemetry pipelines, logging infrastructure, and metrics dashboards using tools

such as Splunk, Prometheus, Grafana, and OpenTelemetry.

Define and maintain SLOs, SLIs, and real-time health indicators across platform

services and APIs.

Participate in on-call rotations and lead the resolution of high-impact incidents,

including root cause analysis and postmortem reporting.

Collaborate with platform engineering teams to enforce governance, compliance, and

security standards in production environments.

Enhance deployment pipelines, CI/CD workflows, and infrastructure automation (e.g.,

GitLab).

Optimize and scale infrastructure components such as Kafka, HAProxy, RMQ,

databases, and distributed APIs.

Support capacity planning, cost analysis, and system tuning to improve platform

performance.

Advocate for automation-first operations, reducing manual toil through scripting and

reliability tooling.

Create and maintain documentation, runbooks, and knowledge-sharing resources

across SRE and engineering teams.

Mentor junior engineers and foster a culture of technical rigor and continuous

improvement.

Qualifications:

Bachelors degree in computer science, Engineering, or a related field (Masters

preferred).

10+ years of experience in SRE, DevOps, or operations engineering in cloud-based

environments.

Hands-on experience with monitoring, alerting, and incident response in distributed

systems.

Strong coding and scripting skills in Python, Java, or shell scripting languages such as

Bash or PowerShell.

Solid understanding of database principles and experience with distributed storage

solutions such as Oracle, Cassandra, SOLR, and Kafka.

Proficiency in CI/CD pipelines and GitLab workflows.

Strong working knowledge of SQL and NoSQL databases, including Oracle and

Cassandra.

Expertise in Linux, networking concepts (TLS/SSL, DNS, load balancers), and

troubleshooting large-scale environments.

Familiarity with AI/ML systems, APIs, and modern LLM tooling is a strong plus.

Expertise in observability tools such as Splunk, Grafana, and Prometheus.

Experience with Kubernetes, container orchestration, and hybrid/multi-cloud

deployments (Azure preferred; AWS/GCP/OCI acceptable).

Deep understanding of security concepts and protocols, including authentication,

authorization, encryption, SSL/TLS, SSH/SFTP, PKI, X.509 certificates, and PGP.

Excellent knowledge of ITIL/ServiceNow terminology for incident and problem

management.

Proven ability to work in fast-paced, incident-driven environments with high uptime

requirements.

Preferred Qualifications:

Experience supporting AI workloads, model inference systems, or LLM-enabled

platforms.

Exposure to AIOps or related ML platform observability and reliability practices.

Familiarity with LangChain, OpenAI, Spring AI, and MCP Server is a strong plus.

Experience in highly regulated telecom environments with compliance and audit

controls.

Understanding of AI Gateway patterns and secure API orchestration.

Background in building secure, zero-downtime platforms with enterprise-scale SLAs.

Knowledge, Skills, and Abilities:

Strong grasp of SRE best practices, including SLOs, SLIs, postmortems, and chaos

engineering.

Ability to diagnose system bottlenecks across infrastructure, application, and network

layers.

Expertise in driving automation across observability, configuration, and deployment

domains.

Excellent communication and collaboration skills in cross-functional technical teams.

Curiosity-driven mindset with a passion for learning emerging AI technologies and

improving system reliability.

Strong commitment to automating processes for proactive monitoring, anomaly

detection, and alerting.

Site Reliability Engineer

Brace Infotech Private Ltd

Job Description

Services you might be interested in

Improve Your Resume Today