Login Sign Up

Principal Site reliability Engineer

nexocean

2 - 5 years

Hyderabad

Posted: 27/04/2026

Getting a referral is 5x more effective than applying directly

Job Description

Principal Site Reliability Engineer


As a Principal Site Reliability Engineer, you will join a high-performing engineering organization responsible for building, operating, and scaling large-scale distributed platforms and enterprise backend systems. You will drive reliability, observability, automation, and operational excellence initiatives to ensure highly available, secure, and cost-efficient infrastructure.You will play a strategic role in defining platform reliability standards, improving system resilience, and partnering closely with engineering and architecture teams to support scalable production environments.


Key Responsibilities

  • Design, implement, and maintain observability, monitoring, and alerting solutions for mission-critical platforms and backend services.
  • Build and manage telemetry pipelines, centralized logging platforms, and operational dashboards using tools such as Splunk, Prometheus, Grafana, and OpenTelemetry.
  • Define and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and availability metricsacross services and APIs.
  • Participate in on-call rotations and lead resolution of critical production incidents, including root cause analysis and post-incident reviews.
  • Collaborate with platform and infrastructure teams to enforce governance, compliance, and security standards in production environments.
  • Enhance deployment automation, CI/CD pipelines, and infrastructure provisioning workflows (e.g., GitLab).
  • Optimize and scale distributed infrastructure components including Kafka, HAProxy, RabbitMQ, databases, and API platforms.
  • Perform capacity planning, performance tuning, and cost optimization for large-scale environments.
  • Champion automation-first operations by eliminating manual processes through scripting and reliability tooling.
  • Develop and maintain operational documentation, runbooks, and knowledge repositories.
  • Mentor engineers and promote a culture of reliability engineering, operational maturity, and continuous improvement.

Qualifications

  • Bachelors degree in Computer Science, Engineering, or related discipline (Masters preferred).
  • 15+ years of overall technology experience with 10+ years in SRE, DevOps, or Production Operations within cloud environments.
  • Proven experience managing monitoring, alerting, and incident response for distributed systems.
  • Strong programming and scripting skills in Python, Java, Bash, or PowerShell.
  • Solid understanding of database architecture and distributed storage technologies such as Oracle, Cassandra, SOLR, and Kafka.
  • Hands-on expertise with CI/CD pipelines and GitLab workflows.
  • Strong experience with SQL and NoSQL databases.
  • Advanced knowledge of Linux systems administration, networking fundamentals (DNS, TLS/SSL, load balancing), and large-scale troubleshooting.
  • Experience with Kubernetes, container orchestration, and hybrid or multi-cloud deployments (Azure preferred; AWS/GCP/OCI acceptable).
  • Deep understanding of enterprise security practices including authentication, authorization, encryption, SSH/SFTP, PKI, X.509 certificates, and PGP.
  • Familiarity with ITIL practices and ServiceNow incident/problem management workflows.
  • Demonstrated ability to operate effectively in high-availability, incident-driven production environments.

Preferred Qualifications

  • Experience supporting large-scale distributed platforms with strict uptime requirements.
  • Exposure to advanced monitoring analytics and operational intelligence practices.
  • Experience working in regulated enterprise or telecom environments with strong compliance and audit controls.
  • Understanding of secure API architectures and enterprise integration patterns.
  • Experience designing zero-downtime deployment strategies and high-availability platforms.

Knowledge, Skills & Abilities

  • Deep understanding of Site Reliability Engineering practices including SLOs, SLIs, incident management, postmortems, and resilience engineering.
  • Strong ability to diagnose performance and reliability issues across infrastructure, application, and network layers.
  • Expertise in automation across observability, configuration management, and deployment workflows.
  • Excellent collaboration and communication skills across engineering, platform, and operations teams.
  • Continuous improvement mindset with strong ownership of platform stability and operational excellence.
  • Passion for proactive monitoring, anomaly detection, and reliability automation.

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.