Job Title: AI PRE Engineer

Location: Noida

Experience: 10-16 Years

AI PRE Engineer (Platform Reliability / Production Readiness Engineer)

The Role

An AI PRE Engineer ensures AI/ML platforms are production-ready, highly reliable, observable, secure, and cost-efficient, bridging AI engineering, SRE, DevOps, and MLOps disciplines.

Responsibilities:

Define and maintain production readiness standards across platform, data, model, application, and security layers.
Establish SLO/SLI frameworks for latency, availability, quality, safety, and drift implement error budget policies.
Publish reference architectures for LLM apps, RAG, vector stores, agent frameworks, and batch/stream inference.
Curate deployment blueprints (canary/shadow, bluegreen, A/B) for models and prompts with rollback guidance.
Standardize observability patterns for prompts, embeddings, latency, cost, quality, and safety telemetry.
Own capacity engineering (token/concurrency budgets, GPU/CPU sizing, vector scaling, cache hierarchies).
Define resilience patterns (timeouts, circuit breakers, fallbacks, idempotent retries, semantic/prompt caching).
Set AI security baselines (secrets, private networking, egress controls) and mandate redteam & safety evaluations.
Maintain compliance mappings (e.g., ISO 27001, SOC 2, GDPR/DPDP, HIPAA where applicable).
Provide CI/CD pipelines, SDKs, Helm/Terraform templates, and policyascode for consistent delivery.
Author PRR checklists, runbooks/playbooks, and DR/BCP blueprints (RTO/RPO, multiregion/site failover). Drive enablement (trainings, brown-bags) and maintain knowledge repositories and decision records.
Partner with solution teams to validate architecture and nonfunctional requirements (scale, latency, cost, safety).
Conduct Production Readiness Reviews (PRRs) and certify releases across performance, security, privacy, and compliance.
Implement observability (tracing, metrics, logs), dashboards, and SLO burn and cost anomaly alerting.
Experience with different IDE such as Jupiter Notebook, Visual Studio Code, PyCharm, etc.
Familiar with AI related libraries like LangChain, PandasAI, OpenAI
Execute safe releases (canary/shadow/blue green), prompt/model versioning, feature flags, and rollback plans.
Lead incident response for AI workloads; perform postincident reviews and drive systemic fixes.
Govern token/cost budgets, autoscaling thresholds, and vector store performance for FinOps efficiency.

Qualifications & Experience

Bachelors degree in computer science, Engineering, or Information Technology
Masters degree in systems architecture, Cloud Computing, or AIrelated disciplines is preferred
914 years of overall IT or platform engineering experience
57 years designing or managing enterprise platforms (AI, data, or cloud platforms)
35 years in architecture or platform strategy roles supporting multiple teams or business units
Production readiness reviews, SLO/SLI/SLA design, incident management, RCA/postmortems, on-call support, and capacity planning for AI/ML platforms
Hands-on experience with AWS/GCP/Azure, GPU-aware infrastructure, Infrastructure as Code (Terraform), Docker, Kubernetes (EKS/GKE/AKS), and managing large-scale, multi-tenant clusters
Deploying ML/LLM workloads to production, model lifecycle management, RAG pipelines, safe rollouts (canary/shadow), rollback strategies, and managing inference scalability and latency
Metrics, logging, tracing, and alerting using Prometheus/Grafana/OpenTelemetry or cloud-native tools; monitoring AI-specific signals such as model drift, latency, token usage, and GPU utilization
Strong coding (Python/Go/Java), CI/CD pipelines (GitHub Actions, Jenkins), GitOps, automated reliability tooling, security best practices (secrets management, access control, AI guardrails)

Certifications Required:

NVIDIA Certified Professional: AI Infrastructure & Operations
NVIDIA DLI Deploying AI with Kubernetes & GPUs
NVIDIA DLI Building AI Infrastructure with NVIDIA Technologies
Certified Kubernetes Administrator
Docker Certified Associate
Red Hat Certified System Administrator (RHCSA)
Linux Foundation Certified System Administrator (LFCS)

AI PRE Engineer

HCLTech

Job Description

Services you might be interested in

Improve Your Resume Today