Principal Site Reliability Engineer (SRE)
Privacera
2 - 5 years
Pune City
Posted: 15/04/2026
Job Description
Role: Principal Site Reliability Engineer (SRE) Data Platforms
Role Summary
Own reliability, support, and operations of enterprise data platforms (Trust3 AI, Snowflake, Databricks)
with a primary focus on Google Cloud Platform (GCP). This is a deeply hands-on Principal SRE role
combining managed services ownership, advanced production engineering, and reliability at scale.
What Youll Do
Own end-to-end platform lifecycle and managed services delivery: installation, operations,
upgrades, optimization, and continuous platform health
Take full ownership of critical production incidents with deep debugging, RCA, and permanent fixes
Troubleshoot complex, cross-system issues across GCP (GKE, IAM, networking), data platforms, and connectors
Lead performance tuning, scalability optimization, and system hardening for high-throughput systems
Design and implement automation across deployments, monitoring, and operations
Manage secrets and secure integrations using Vault (or similar) within platform and CI/CD workflows
Install, upgrade, and operate Trust3 AI on GCP (GKE) across multi-region environments
Ensure accurate and reliable enforcement of data access policies
Build and enhance observability (metrics, logs, alerts) for proactive issue detection
Eliminate operational toil through continuous reliability improvements
Own issues end-to-end with strong stakeholder communication and SLA adherence
Collaborate with Engineering and Product to resolve issues and influence platform improvements
Lead managed services operations including monitoring, incident prevention, capacity planning,
DR readiness, and service-level outcomes (SLA, uptime, upgrade timelines)
Skills Required
Cloud: Strong expertise in GCP (GKE, IAM, BigQuery, GCS, VPC, Cloud Monitoring/Logging); AWS/Azure exposure is a plus
Data Platforms: Snowflake, Databricks, BigQuery
Infra & CI/CD: Kubernetes, Helm, CI/CD (GitHub Actions, GitLab CI, or similar), Terraform (preferred)
Scripting: Python / Bash
Observability: Prometheus, Grafana, ELK
Security: IAM, RBAC/ABAC, data governance (Trust3 AI/Ranger preferred), secrets management (Vault or similar)
Experience
10+ years in SRE / DevOps / Production Engineering
Strong expertise in debugging distributed systems and complex production environments
Proven ownership of high-severity incidents and large-scale production systems
Demonstrated ability to independently solve ambiguous, high-impact technical problems
Track record of driving reliability, automation, and operational excellence at scale
Experience running high-throughput, always-on (24x7) systems with large data volumes and strict uptime SLAs
Why This Role
Principal-level, deeply hands-on IC role (no people management)
End-to-end ownership of mission-critical data platforms
Work on complex production challenges across cloud, data, and security layers
High impact on enterprise data access, governance, and reliability
Important Note
This is a production-first role involving end-to-end incident ownership, deep technical problem solving,
and managed services operations not a pure DevOps/build-only or people management role.
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
