🔔 FCM Loaded

AI PRE Engineer

HCLTech

10 - 16 years

Noida

Posted: 26/02/2026

Getting a referral is 5x more effective than applying directly

Job Description

Job Title: AI PRE Engineer

Location: Noida

Experience: 10-16 Years


AI PRE Engineer (Platform Reliability / Production Readiness Engineer)

The Role

An AI PRE Engineer ensures AI/ML platforms are production-ready, highly reliable, observable, secure, and cost-efficient, bridging AI engineering, SRE, DevOps, and MLOps disciplines.

Responsibilities:


  • Define and maintain production readiness standards across platform, data, model, application, and security layers.
  • Establish SLO/SLI frameworks for latency, availability, quality, safety, and drift implement error budget policies.
  • Publish reference architectures for LLM apps, RAG, vector stores, agent frameworks, and batch/stream inference.
  • Curate deployment blueprints (canary/shadow, bluegreen, A/B) for models and prompts with rollback guidance.
  • Standardize observability patterns for prompts, embeddings, latency, cost, quality, and safety telemetry.
  • Own capacity engineering (token/concurrency budgets, GPU/CPU sizing, vector scaling, cache hierarchies).
  • Define resilience patterns (timeouts, circuit breakers, fallbacks, idempotent retries, semantic/prompt caching).
  • Set AI security baselines (secrets, private networking, egress controls) and mandate redteam & safety evaluations.
  • Maintain compliance mappings (e.g., ISO 27001, SOC 2, GDPR/DPDP, HIPAA where applicable).
  • Provide CI/CD pipelines, SDKs, Helm/Terraform templates, and policyascode for consistent delivery.
  • Author PRR checklists, runbooks/playbooks, and DR/BCP blueprints (RTO/RPO, multiregion/site failover). Drive enablement (trainings, brown-bags) and maintain knowledge repositories and decision records.
  • Partner with solution teams to validate architecture and nonfunctional requirements (scale, latency, cost, safety).
  • Conduct Production Readiness Reviews (PRRs) and certify releases across performance, security, privacy, and compliance.
  • Implement observability (tracing, metrics, logs), dashboards, and SLO burn and cost anomaly alerting.
  • Experience with different IDE such as Jupiter Notebook, Visual Studio Code, PyCharm, etc.
  • Familiar with AI related libraries like LangChain, PandasAI, OpenAI
  • Execute safe releases (canary/shadow/blue green), prompt/model versioning, feature flags, and rollback plans.
  • Lead incident response for AI workloads; perform postincident reviews and drive systemic fixes.
  • Govern token/cost budgets, autoscaling thresholds, and vector store performance for FinOps efficiency.


Qualifications & Experience


  • Bachelors degree in computer science, Engineering, or Information Technology
  • Masters degree in systems architecture, Cloud Computing, or AIrelated disciplines is preferred
  • 914 years of overall IT or platform engineering experience
  • 57 years designing or managing enterprise platforms (AI, data, or cloud platforms)
  • 35 years in architecture or platform strategy roles supporting multiple teams or business units
  • Production readiness reviews, SLO/SLI/SLA design, incident management, RCA/postmortems, on-call support, and capacity planning for AI/ML platforms
  • Hands-on experience with AWS/GCP/Azure, GPU-aware infrastructure, Infrastructure as Code (Terraform), Docker, Kubernetes (EKS/GKE/AKS), and managing large-scale, multi-tenant clusters
  • Deploying ML/LLM workloads to production, model lifecycle management, RAG pipelines, safe rollouts (canary/shadow), rollback strategies, and managing inference scalability and latency
  • Metrics, logging, tracing, and alerting using Prometheus/Grafana/OpenTelemetry or cloud-native tools; monitoring AI-specific signals such as model drift, latency, token usage, and GPU utilization
  • Strong coding (Python/Go/Java), CI/CD pipelines (GitHub Actions, Jenkins), GitOps, automated reliability tooling, security best practices (secrets management, access control, AI guardrails)

Certifications Required:

  • NVIDIA Certified Professional: AI Infrastructure & Operations
  • NVIDIA DLI Deploying AI with Kubernetes & GPUs
  • NVIDIA DLI Building AI Infrastructure with NVIDIA Technologies
  • Certified Kubernetes Administrator
  • Docker Certified Associate
  • Red Hat Certified System Administrator (RHCSA)
  • Linux Foundation Certified System Administrator (LFCS)

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.