Site Reliability Engineer
Brace Infotech Private Ltd
10 - 16 years
Bengaluru
Posted: 22/02/2026
Job Description
Looking for Site Reliable Engineer with below skills
Skills Required:
SQL, NOSQL, Nagios, Cloudwatch, Zabbix, Datadog, New Relic, Prometheus, Grafana,
App Dynamics, Site24x7, Telemetry, Splunk, CI CD, CI/CD, CICD, DevOps, Kentico,
SRE, Site Reliability, AIOps, Agentic, GEN AI, AI, ML
Experience Range:
10 - 16 years
Key Responsibilities:
Design, develop and maintain observability, monitoring, and alerting systems for AI
platforms and mission-critical backend services.
Design telemetry pipelines, logging infrastructure, and metrics dashboards using tools
such as Splunk, Prometheus, Grafana, and OpenTelemetry.
Define and maintain SLOs, SLIs, and real-time health indicators across platform
services and APIs.
Participate in on-call rotations and lead the resolution of high-impact incidents,
including root cause analysis and postmortem reporting.
Collaborate with platform engineering teams to enforce governance, compliance, and
security standards in production environments.
Enhance deployment pipelines, CI/CD workflows, and infrastructure automation (e.g.,
GitLab).
Optimize and scale infrastructure components such as Kafka, HAProxy, RMQ,
databases, and distributed APIs.
Support capacity planning, cost analysis, and system tuning to improve platform
performance.
Advocate for automation-first operations, reducing manual toil through scripting and
reliability tooling.
Create and maintain documentation, runbooks, and knowledge-sharing resources
across SRE and engineering teams.
Mentor junior engineers and foster a culture of technical rigor and continuous
improvement.
Qualifications:
Bachelors degree in computer science, Engineering, or a related field (Masters
preferred).
10+ years of experience in SRE, DevOps, or operations engineering in cloud-based
environments.
Hands-on experience with monitoring, alerting, and incident response in distributed
systems.
Strong coding and scripting skills in Python, Java, or shell scripting languages such as
Bash or PowerShell.
Solid understanding of database principles and experience with distributed storage
solutions such as Oracle, Cassandra, SOLR, and Kafka.
Proficiency in CI/CD pipelines and GitLab workflows.
Strong working knowledge of SQL and NoSQL databases, including Oracle and
Cassandra.
Expertise in Linux, networking concepts (TLS/SSL, DNS, load balancers), and
troubleshooting large-scale environments.
Familiarity with AI/ML systems, APIs, and modern LLM tooling is a strong plus.
Expertise in observability tools such as Splunk, Grafana, and Prometheus.
Experience with Kubernetes, container orchestration, and hybrid/multi-cloud
deployments (Azure preferred; AWS/GCP/OCI acceptable).
Deep understanding of security concepts and protocols, including authentication,
authorization, encryption, SSL/TLS, SSH/SFTP, PKI, X.509 certificates, and PGP.
Excellent knowledge of ITIL/ServiceNow terminology for incident and problem
management.
Proven ability to work in fast-paced, incident-driven environments with high uptime
requirements.
Preferred Qualifications:
Experience supporting AI workloads, model inference systems, or LLM-enabled
platforms.
Exposure to AIOps or related ML platform observability and reliability practices.
Familiarity with LangChain, OpenAI, Spring AI, and MCP Server is a strong plus.
Experience in highly regulated telecom environments with compliance and audit
controls.
Understanding of AI Gateway patterns and secure API orchestration.
Background in building secure, zero-downtime platforms with enterprise-scale SLAs.
Knowledge, Skills, and Abilities:
Strong grasp of SRE best practices, including SLOs, SLIs, postmortems, and chaos
engineering.
Ability to diagnose system bottlenecks across infrastructure, application, and network
layers.
Expertise in driving automation across observability, configuration, and deployment
domains.
Excellent communication and collaboration skills in cross-functional technical teams.
Curiosity-driven mindset with a passion for learning emerging AI technologies and
improving system reliability.
Strong commitment to automating processes for proactive monitoring, anomaly
detection, and alerting.
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
