🔔 FCM Loaded

Senior AI Data Platform Reliability & Validation Engineer 3

Oracle

5 - 10 years

Bengaluru

Posted: 10/12/2025

Getting a referral is 5x more effective than applying directly

Job Description

Responsibilities

Key Responsibilities:

  • Design, develop, and execute end-to-end (E2E) scenario validations that simulate real-world usage of complex AI data platform workflows (data ingestion, transformation, ML pipeline orchestration, etc.).
  • Collaborate closely with product, engineering, and field teams to identify gaps in coverage and propose test automation strategies.
  • Develop and maintain automated test frameworks supporting E2E, integration, performance, and regression testing for distributed data/AI services
  • Monitor system health across the stack (infrastructure, data pipelines, AI/ML workloads), proactively detect failures or SLA breaches.
  • Champion SRE best practices including observability, incident management, blameless postmortems, and runbook automation.
  • Analyze logs, traces, and metrics to identify reliability, latency, and scalability issues; drive root cause analysis and corrective actions.
  • Partner with engineering to drive high-availability, fault tolerance, and continuous delivery (CI/CD) improvements.
  • Participate in on-call rotation to support critical services, ensuring rapid resolution and minimizing customer impact.

Desired Qualifications:

  • Bachelors or masters degree in computer science, Engineering, or related field (or demonstrated equivalent experience)
  • 5+ years experience in software QA/validation, SRE, or DevOps roles, ideally in data platforms, cloud, or AI/ML environments.
  • Proficient with DevOps automation and tools for continuous integration, deployment, and monitoring (e.g., Terraform, Jenkins, GitLab CI/CD, Prometheus).
  • Working knowledge of distributed systems, data engineering pipelines, and cloud-native architectures (OCI, AWS, Azure, GCP, etc.).
  • Strong proficiency in Java, Python and related technologies
  • Hands-on experience with test automation frameworks (e.g., Selenium, pytest, JUnit) and scripting (Python, Bash, etc.).
  • Familiarity with SRE practices: service-level objectives (SLO/SLA), incident response, observability (Prometheus, Grafana, ELK, etc.).
  • Strong troubleshooting and analytical skills with a passion for reliability engineering and process automation.
  • Excellent communication and cross-team collaboration abilities.oling / infrastructure

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.