Role-Prometheus/ Grafana to Datadog Migration

Experience-5+ Years

Location-Bangalore Work from Office

We estimate an initial period of 3 months, which can be extended based on performance and project conditions.

Most Mandatory but I'll focus again on the most critical requirements:

- Must have very good/exceptional communication skills. This person will need to interact with engineers and debate possible solutions with them.

- Must have very good experience of Datadog and migrations to Datadog from Prometheus/Grafana specifically. Ideally, several migrations must have been completed using automation, not manually.

- Can have knowledge about Ansible and Terraform.

-Migration projects from our current technologies to Datadog (please focus on the number of dashboards and alerts migratedthese should be in the thousandsand note the high level of interaction with engineering teams, and automated, so not done manually).

-Expertise or high level familiarity with Terraform and Ansible, specifically the ability to install, roll out, and troubleshoot Datadog agents, and to understand, read and debug issues related to Datadog and/or the migration process.

Interview Round -Technical Screening + Technical Assessment(Live Coding proper screen sharing )

For Interview Laptop and Good Connectivity is mandatory

we need to start a new project for migration of our Observability Infra Stack from self-hosted AWS ( Prometheus/Grafana, Loki,Mimir) to Datadog Solution ( SAAS).That will focus on Engineering deliverables set by the SRE Team for migration.

Here are the details on the skills required to work on this project and each candidate will go through technical screening and assessment before getting finalized:

Role: SRE / DevOps Engineer

SKILLS:

1. Working Knowledge of Prometheus and PromQL:

- Ability to read, understand, and modify existing PromQL queries, dashboards, and alerting rules, including common aggregations and label usage.

2. Grafana and Alertmanager Familiarity:

- Experience navigating Grafana dashboards and Alertmanager configurations to understand intent, thresholds, and alert routing.

3. Datadog Dashboarding and Monitors

- Hands-on experience creating Datadog dashboards and monitors based on defined requirements, using existing patterns and guidance.

4. Query and Alert Semantics Translation

- Ability to accurately map PromQL queries and Alertmanager rules to Datadog equivalents, recognising non-1:1 translations, validating statistical correctness, and documenting functional differences where exact parity is not possible.

5. Observability Concepts

- Understanding of metrics vs logs vs traces, alert thresholds, and standard monitoring practices in production environments.

6. Team Collaboration

- Ability to work with engineering teams to validate migrated dashboards and alerts, following structured validation checklists.

7. Clear Execution and Documentation

- Documenting migrated assets, assumptions, and validation outcomes in a consistent, predefined format.

8. Automation Skills

- Proficient is building tooling using python to reduce engineering toil for these migration activities.

Experience Required:

Must Have:

- Atleast 5 years of relevant experience in working on Observability stack as defined above.

- Has managed and operated Datadog Platform.

- Strong communication skills to interact with global teams.

- Fundamental knowledge of working and operating on AWS using IAC practices.

Nice to Have:

- AWS Administrator Certifications.

- Located in Bangalore.

Site Reliability Engineer

HireAlpha

Let experts apply while you prepare for interviews

Job Description

Services you might be interested in

We Search & Apply Jobs for You!