Role: Principal Site Reliability Engineer (SRE) Data Platforms

Role Summary

Own reliability, support, and operations of enterprise data platforms (Trust3 AI, Snowflake, Databricks)

with a primary focus on Google Cloud Platform (GCP). This is a deeply hands-on Principal SRE role

combining managed services ownership, advanced production engineering, and reliability at scale.

What Youll Do

Own end-to-end platform lifecycle and managed services delivery: installation, operations,

upgrades, optimization, and continuous platform health

Take full ownership of critical production incidents with deep debugging, RCA, and permanent fixes

Troubleshoot complex, cross-system issues across GCP (GKE, IAM, networking), data platforms, and connectors

Lead performance tuning, scalability optimization, and system hardening for high-throughput systems

Design and implement automation across deployments, monitoring, and operations

Manage secrets and secure integrations using Vault (or similar) within platform and CI/CD workflows

Install, upgrade, and operate Trust3 AI on GCP (GKE) across multi-region environments

Ensure accurate and reliable enforcement of data access policies

Build and enhance observability (metrics, logs, alerts) for proactive issue detection

Eliminate operational toil through continuous reliability improvements

Own issues end-to-end with strong stakeholder communication and SLA adherence

Collaborate with Engineering and Product to resolve issues and influence platform improvements

Lead managed services operations including monitoring, incident prevention, capacity planning,

DR readiness, and service-level outcomes (SLA, uptime, upgrade timelines)

Skills Required

Cloud: Strong expertise in GCP (GKE, IAM, BigQuery, GCS, VPC, Cloud Monitoring/Logging); AWS/Azure exposure is a plus

Data Platforms: Snowflake, Databricks, BigQuery

Infra & CI/CD: Kubernetes, Helm, CI/CD (GitHub Actions, GitLab CI, or similar), Terraform (preferred)

Scripting: Python / Bash

Observability: Prometheus, Grafana, ELK

Security: IAM, RBAC/ABAC, data governance (Trust3 AI/Ranger preferred), secrets management (Vault or similar)

Experience

10+ years in SRE / DevOps / Production Engineering

Strong expertise in debugging distributed systems and complex production environments

Proven ownership of high-severity incidents and large-scale production systems

Demonstrated ability to independently solve ambiguous, high-impact technical problems

Track record of driving reliability, automation, and operational excellence at scale

Experience running high-throughput, always-on (24x7) systems with large data volumes and strict uptime SLAs

Why This Role

Principal-level, deeply hands-on IC role (no people management)

End-to-end ownership of mission-critical data platforms

Work on complex production challenges across cloud, data, and security layers

High impact on enterprise data access, governance, and reliability

Important Note

This is a production-first role involving end-to-end incident ownership, deep technical problem solving,

and managed services operations not a pure DevOps/build-only or people management role.

Principal Site Reliability Engineer (SRE)