AI PRE Engineer
HCLTech
10 - 16 years
Noida
Posted: 26/02/2026
Getting a referral is 5x more effective than applying directly
Job Description
Job Title: AI PRE Engineer
Location: Noida
Experience: 10-16 Years
AI PRE Engineer (Platform Reliability / Production Readiness Engineer)
The Role
An AI PRE Engineer ensures AI/ML platforms are production-ready, highly reliable, observable, secure, and cost-efficient, bridging AI engineering, SRE, DevOps, and MLOps disciplines.
Responsibilities:
- Define and maintain production readiness standards across platform, data, model, application, and security layers.
- Establish SLO/SLI frameworks for latency, availability, quality, safety, and drift implement error budget policies.
- Publish reference architectures for LLM apps, RAG, vector stores, agent frameworks, and batch/stream inference.
- Curate deployment blueprints (canary/shadow, bluegreen, A/B) for models and prompts with rollback guidance.
- Standardize observability patterns for prompts, embeddings, latency, cost, quality, and safety telemetry.
- Own capacity engineering (token/concurrency budgets, GPU/CPU sizing, vector scaling, cache hierarchies).
- Define resilience patterns (timeouts, circuit breakers, fallbacks, idempotent retries, semantic/prompt caching).
- Set AI security baselines (secrets, private networking, egress controls) and mandate redteam & safety evaluations.
- Maintain compliance mappings (e.g., ISO 27001, SOC 2, GDPR/DPDP, HIPAA where applicable).
- Provide CI/CD pipelines, SDKs, Helm/Terraform templates, and policyascode for consistent delivery.
- Author PRR checklists, runbooks/playbooks, and DR/BCP blueprints (RTO/RPO, multiregion/site failover). Drive enablement (trainings, brown-bags) and maintain knowledge repositories and decision records.
- Partner with solution teams to validate architecture and nonfunctional requirements (scale, latency, cost, safety).
- Conduct Production Readiness Reviews (PRRs) and certify releases across performance, security, privacy, and compliance.
- Implement observability (tracing, metrics, logs), dashboards, and SLO burn and cost anomaly alerting.
- Experience with different IDE such as Jupiter Notebook, Visual Studio Code, PyCharm, etc.
- Familiar with AI related libraries like LangChain, PandasAI, OpenAI
- Execute safe releases (canary/shadow/blue green), prompt/model versioning, feature flags, and rollback plans.
- Lead incident response for AI workloads; perform postincident reviews and drive systemic fixes.
- Govern token/cost budgets, autoscaling thresholds, and vector store performance for FinOps efficiency.
Qualifications & Experience
- Bachelors degree in computer science, Engineering, or Information Technology
- Masters degree in systems architecture, Cloud Computing, or AIrelated disciplines is preferred
- 914 years of overall IT or platform engineering experience
- 57 years designing or managing enterprise platforms (AI, data, or cloud platforms)
- 35 years in architecture or platform strategy roles supporting multiple teams or business units
- Production readiness reviews, SLO/SLI/SLA design, incident management, RCA/postmortems, on-call support, and capacity planning for AI/ML platforms
- Hands-on experience with AWS/GCP/Azure, GPU-aware infrastructure, Infrastructure as Code (Terraform), Docker, Kubernetes (EKS/GKE/AKS), and managing large-scale, multi-tenant clusters
- Deploying ML/LLM workloads to production, model lifecycle management, RAG pipelines, safe rollouts (canary/shadow), rollback strategies, and managing inference scalability and latency
- Metrics, logging, tracing, and alerting using Prometheus/Grafana/OpenTelemetry or cloud-native tools; monitoring AI-specific signals such as model drift, latency, token usage, and GPU utilization
- Strong coding (Python/Go/Java), CI/CD pipelines (GitHub Actions, Jenkins), GitOps, automated reliability tooling, security best practices (secrets management, access control, AI guardrails)
Certifications Required:
- NVIDIA Certified Professional: AI Infrastructure & Operations
- NVIDIA DLI Deploying AI with Kubernetes & GPUs
- NVIDIA DLI Building AI Infrastructure with NVIDIA Technologies
- Certified Kubernetes Administrator
- Docker Certified Associate
- Red Hat Certified System Administrator (RHCSA)
- Linux Foundation Certified System Administrator (LFCS)
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
