AI/ML Observability Engineer
Dexian India
2 - 5 years
Hyderabad
Posted: 15/04/2026
Job Description
Overview
We are seeking a passionate and hands-on AI/ML Engineer to accelerate our Enterprise Observability strategy. This role will design, build, and operationalize AI/ML capabilities that enhance end to end telemetry pipelines, anomaly detection, intelligent alerting, and proactive system resiliency.
You will work at the intersection of AI/ML engineering, Observability platforms, and automation, developing solutions that improve detection, diagnosis, and prevention of operational issues across distributed systems.
________________________________________
Key Responsibilities
Design and deploy AI/ML models supporting anomaly detection, baselining, event correlation, and predictive operational analytics.
Build and integrate AIenabled capabilities into enterprise Observability platforms, including Grafana, APM/RUM tools, network telemetry systems, and data observability tools.
Develop AI Agents that can autonomously triage issues, recommend corrective actions, and initiate automated remediation workflows to reduce recovery time and improve system resilience.
Implement selfhealing automation using AIdriven decisioning, integrating with orchestration frameworks, service APIs, and infrastructure automation pipelines.
Engineer and maintain realtime and batch data pipelines using Snowflake ML Jobs, Snowflake Cortex, streams, tasks, and UDFs.
Implement and manage OpenTelemetrybased telemetry ingestion for logs, metrics, traces, and spans across distributed systems.
Build asynchronous Python APIs and services for model inferencing and operational integration.
Enhance observability intelligence with AI-powered capabilities such as rootcause acceleration, chatbot/search enablement, and automated insights.
Contribute to SLO/SLI modeling, Golden Signals instrumentation, and Observability NFR adoption.
Collaborate across engineering, SRE, platform and business teams to embed proactive intelligence and Observability standards throughout the ecosystem.
Required Skills & Qualifications
Core Technical Skills
Strong proficiency in Python and data science/ML libraries:
NumPy, Pandas, scikit learn, TensorFlow, PyTorch, Matplotlib, Seaborn.
Experience with Generative AI, LLM fine tuning, prompt engineering, RAG pipelines, and LLM evaluation frameworks.
Expertise in developing and deploying ML models in production (batch & streaming).
Strong understanding of statistics, time series modeling, and anomaly detection.
Observability & Telemetry
Experience with OpenTelemetry for logs, metrics, traces, spans.
Familiarity with Observability concepts:
Golden Signals, SLO/SLI design, APM, RUM, Synthetics, event correlation, baselining.
Experience with Observability tools such as:
Grafana (Alloy agents, dashboards, ML capabilities), Dynatrace, Monte Carlo (Data Observability), Netscout, ThousandEyes, SolarWinds, NetBrain.
Cloud, Data & Platform
Hands on with AWS (SageMaker, Bedrock), Snowflake ML, Snowflake/Openflow, Snowflake AI Observability tooling.
Experience building Snowflake data pipelines (streams, tasks, UDFs) plus for Cortex features.
Strong understanding of distributed systems and microservices telemetry requirements.
Automation & Engineering Quality
Experience with automation pipelines, CI/CD, and infrastructure as code patterns supporting Observability adoption.
Ability to build asynchronous Python APIs or services for model inference and operational integration.
________________________________________
Preferred Qualifications
Experience developing agentic AI systems that analyze telemetry, generate action recommendations, or execute automated operational responses.
Experience building selfhealing patterns, including automated rollback, service restarts, configuration corrections, and predictive maintenance.
Experience in Snowflake ML workflows, Snowflake Cortex Agents, and data pipeline automation.
Exposure to AI-enabled alerting, RCA automation, and operational selfhealing concepts.
Experience with large-scale operational telemetry and multi-cloud ecosystems.
Soft Skills
Strong analytical thinking and problem solving.
Excellent communication skills for cross functional collaboration with infrastructure, SRE, engineering, business, and leadership teams.
Curiosity, continuous learning mindset, and passion for applied AI and Observability.
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
