Role Overview

As a Level 3 AWS Infrastructure Support Engineer , you will own overnight monitoring and response for Electronikmedias Clients' AWS-based production environment. You will:

Monitor system health using Datadog and AWS-native tools
Investigate alerts and anomalies using established runbooks
Resolve production incidents when possible
Escalate complex issues quickly and accurately
Maintain clean, auditable incident documentation

This role is ideal for someone who thrives in high-trust, high-impact operational environments.

Key ResponsibilitiesOn-Call & Incident Response

Provide initial response within 15 minutes for all high-priority production alerts
Investigate, mitigate, and resolve production outages when feasible
Escalate unresolved or complex issues using the defined escalation matrix
Act as the owner of the production system stability

Monitoring, Alerting & Observability

Analyze and respond to Datadog monitor alerts across infrastructure and application layers
Identify abnormal patterns, trend-line deviations, and early indicators of systemic risk
Proactively notify stakeholders of significant performance or stability concerns
Contribute insights for preventive and corrective actions

Root Cause & Trend Analysis

Track recurring alerts and incidents
Provide analysis and recommendations to reduce alert noise and improve system resilience
Participate in weekly validation of Datadog alert configurations and thresholds

Communication & Documentation

Maintain clear, concise, and timely communication during incidents
Document all incidents, alarms, and observations in Jira during each shift
Ensure handoff notes are complete and actionable for daytime engineering teams

Technical EnvironmentCore AWS Services

ECS (Fargate)
RDS
ElastiCache
EC2
Lambda
API Gateway
S3

Tooling

Datadog (monitoring, alerts, dashboards)
Jira (incident tracking and documentation)

QualificationsExperience

5+ years of hands-on AWS infrastructure administration and support
Proven experience supporting production-grade, high-availability systems
Strong background in incident response within enterprise or scale-up environments

Skills

Deep operational knowledge of AWS services and distributed systems
Strong troubleshooting and root-cause analysis skills under tight SLAs
Ability to follow runbooks while also knowing when to think beyond them
Calm, structured decision-making during production incidents

Certifications (Preferred)

AWS Certified Solutions Architect Associate or Professional
AWS Certified DevOps Engineer Professional (Nice to Have)

Service Level Expectations

Alert Escalation SLA: 15 minutes for high-priority alarms
Availability: Consistent overnight coverage ( IST Day Shift )
Reliability: Zero missed critical alerts during assigned coverage windows

Deliverables

Monthly Service Performance Report , including:
Alerts monitored
Incidents resolved
Escalations
SLA adherence metrics
Weekly Datadog Validation , ensuring alert accuracy and functionality

Level 3 AWS Infrastructure Support Engineer

Electronikmedia (EM)

Job Description

Services you might be interested in

Improve Your Resume Today