Senior AI Data Platform Reliability & Validation Engineer 3
Oracle
5 - 10 years
Bengaluru
Posted: 10/12/2025
Getting a referral is 5x more effective than applying directly
Job Description
Responsibilities
Key Responsibilities:
- Design, develop, and execute end-to-end (E2E) scenario validations that simulate real-world usage of complex AI data platform workflows (data ingestion, transformation, ML pipeline orchestration, etc.).
- Collaborate closely with product, engineering, and field teams to identify gaps in coverage and propose test automation strategies.
- Develop and maintain automated test frameworks supporting E2E, integration, performance, and regression testing for distributed data/AI services
- Monitor system health across the stack (infrastructure, data pipelines, AI/ML workloads), proactively detect failures or SLA breaches.
- Champion SRE best practices including observability, incident management, blameless postmortems, and runbook automation.
- Analyze logs, traces, and metrics to identify reliability, latency, and scalability issues; drive root cause analysis and corrective actions.
- Partner with engineering to drive high-availability, fault tolerance, and continuous delivery (CI/CD) improvements.
- Participate in on-call rotation to support critical services, ensuring rapid resolution and minimizing customer impact.
Desired Qualifications:
- Bachelors or masters degree in computer science, Engineering, or related field (or demonstrated equivalent experience)
- 5+ years experience in software QA/validation, SRE, or DevOps roles, ideally in data platforms, cloud, or AI/ML environments.
- Proficient with DevOps automation and tools for continuous integration, deployment, and monitoring (e.g., Terraform, Jenkins, GitLab CI/CD, Prometheus).
- Working knowledge of distributed systems, data engineering pipelines, and cloud-native architectures (OCI, AWS, Azure, GCP, etc.).
- Strong proficiency in Java, Python and related technologies
- Hands-on experience with test automation frameworks (e.g., Selenium, pytest, JUnit) and scripting (Python, Bash, etc.).
- Familiarity with SRE practices: service-level objectives (SLO/SLA), incident response, observability (Prometheus, Grafana, ELK, etc.).
- Strong troubleshooting and analytical skills with a passion for reliability engineering and process automation.
- Excellent communication and cross-team collaboration abilities.oling / infrastructure
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
