Principal Site Reliability Engineer

As a Principal Site Reliability Engineer, you will join a high-performing engineering organization responsible for building, operating, and scaling large-scale distributed platforms and enterprise backend systems. You will drive reliability, observability, automation, and operational excellence initiatives to ensure highly available, secure, and cost-efficient infrastructure.You will play a strategic role in defining platform reliability standards, improving system resilience, and partnering closely with engineering and architecture teams to support scalable production environments.

Key Responsibilities

Design, implement, and maintain observability, monitoring, and alerting solutions for mission-critical platforms and backend services.
Build and manage telemetry pipelines, centralized logging platforms, and operational dashboards using tools such as Splunk, Prometheus, Grafana, and OpenTelemetry.
Define and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and availability metricsacross services and APIs.
Participate in on-call rotations and lead resolution of critical production incidents, including root cause analysis and post-incident reviews.
Collaborate with platform and infrastructure teams to enforce governance, compliance, and security standards in production environments.
Enhance deployment automation, CI/CD pipelines, and infrastructure provisioning workflows (e.g., GitLab).
Optimize and scale distributed infrastructure components including Kafka, HAProxy, RabbitMQ, databases, and API platforms.
Perform capacity planning, performance tuning, and cost optimization for large-scale environments.
Champion automation-first operations by eliminating manual processes through scripting and reliability tooling.
Develop and maintain operational documentation, runbooks, and knowledge repositories.
Mentor engineers and promote a culture of reliability engineering, operational maturity, and continuous improvement.

Qualifications

Bachelors degree in Computer Science, Engineering, or related discipline (Masters preferred).
15+ years of overall technology experience with 10+ years in SRE, DevOps, or Production Operations within cloud environments.
Proven experience managing monitoring, alerting, and incident response for distributed systems.
Strong programming and scripting skills in Python, Java, Bash, or PowerShell.
Solid understanding of database architecture and distributed storage technologies such as Oracle, Cassandra, SOLR, and Kafka.
Hands-on expertise with CI/CD pipelines and GitLab workflows.
Strong experience with SQL and NoSQL databases.
Advanced knowledge of Linux systems administration, networking fundamentals (DNS, TLS/SSL, load balancing), and large-scale troubleshooting.
Experience with Kubernetes, container orchestration, and hybrid or multi-cloud deployments (Azure preferred; AWS/GCP/OCI acceptable).
Deep understanding of enterprise security practices including authentication, authorization, encryption, SSH/SFTP, PKI, X.509 certificates, and PGP.
Familiarity with ITIL practices and ServiceNow incident/problem management workflows.
Demonstrated ability to operate effectively in high-availability, incident-driven production environments.

Preferred Qualifications

Experience supporting large-scale distributed platforms with strict uptime requirements.
Exposure to advanced monitoring analytics and operational intelligence practices.
Experience working in regulated enterprise or telecom environments with strong compliance and audit controls.
Understanding of secure API architectures and enterprise integration patterns.
Experience designing zero-downtime deployment strategies and high-availability platforms.

Knowledge, Skills & Abilities

Deep understanding of Site Reliability Engineering practices including SLOs, SLIs, incident management, postmortems, and resilience engineering.
Strong ability to diagnose performance and reliability issues across infrastructure, application, and network layers.
Expertise in automation across observability, configuration management, and deployment workflows.
Excellent collaboration and communication skills across engineering, platform, and operations teams.
Continuous improvement mindset with strong ownership of platform stability and operational excellence.
Passion for proactive monitoring, anomaly detection, and reliability automation.

Principal Site reliability Engineer

nexocean

Job Description

Services you might be interested in

Improve Your Resume Today