Overview

Are you passionate about cloud computing, obsessed with customer experience, and driven to resolve complex issues under pressure? Do you thrive in high-stakes, live environments and want to play a pivotal role in ensuring the reliability of Microsoft’s cloud platform? If so, the Azure Customer Experience (CXP) team has the opportunity for you.

Microsoft Azure is one of the most exciting and strategic products at Microsoft—powering mission-critical workloads for enterprises, governments, and startups around the world. Azure delivers on-demand, hyper-scale infrastructure and platforms via Microsoft's global data centers, enabling customers to build, host, and scale their applications with confidence.

The Customer Reliability Engineering (CRE) team within Azure CXP is a top-level pillar of Azure Engineering responsible for world-class live-site management, customer reliability engagements, modern customer-first experiences for scale, and drives deep customer insights and empathy into the broader Azure Engineering organization. Our “no dead-end’s” philosophy ensures that every customer, regardless of size or scale, can realize their full potential through the Microsoft Cloud

We are seeking decisive and experienced Service Engineers for Live Site Issues, Problem Management and driving Customer reliability space. This role is accountable for enhancing the customer experience across Azure, including First Party Services. The ideal candidate will demonstrate strong breadth in managing complex, highly available services, paired with deep technical expertise in Azure Core Services and their inter dependencies. You will work closely with Customers, First Parties, Customer Support, Livesite, and Engineering teams to deliver critical, customer-facing features. Success in this role requires the ability to influence and collaborate across many Azure servicing teams to ensure customer needs are met.

In addition, this role includes on-call responsibilities for managing and resolving complex multi-service outages. It requires the ability to remain effective under pressure, apply broad technical and analytical skills, and coordinate seamlessly with internal service teams and stakeholders. Strong communication skills—both written and verbal—are essential. You will also lead the evolution of Azure's Incident Management practice through Post-Incident Reviews, process development, and system automation. By leveraging telemetry and metrics, you will identify and drive platform-wide improvements with global impact. You’ll be the single point of command and control during high-severity incidents, orchestrating cross-functional engineering, operations, and communications to minimize impact, restore services quickly, and protect the trust of our global customer base.

This role offers a unique opportunity to make immediate impact, improve systems at scale.

Qualifications

Required Qualifications:

- 10+ Yrs of experience in roles cloud operations, incident response, SRE or large-scale system engineering preferably in platforms like Azure, AWS, or GCP.
- Extensive service engineering experience in always-on, zero-downtime enterprise environments, operating at global scale 24x7x365
- Exceptional command presence and executive-grade communication skills—able to impose clarity, direction, and alignment across customers, senior stakeholders, and third-party vendors in high-stakes, high-ambiguity situations
- Deep mastery of modern cloud architecture patterns, microservices design, and enterprise-grade container orchestration at scale
- Demonstrated ability to make critical, time-bound decisions under pressure, and with limited data—without compromising long-term reliability.
- Advanced proficiency with enterprise observability and monitoring ecosystems (Grafana, Prometheus, Datadog, Splunk, New Relic),
- Lead or significantly contribute to building AI-augmented observability frameworks to proactively predict, detect, and eliminate performance bottlenecks
- Expert-level knowledge of CI/CD automation pipelines, large-scale container orchestration (Kubernetes, Docker), and infrastructure as code solutions (Terraform, ARM, Bicep) for hyperscale deployments.
- Hands-on experience with AI/ML frameworks and production-grade cloud AI services, applying them to operational intelligence and automation
- Proven success deploying AI-driven monitoring, predictive alerting, and automated remediation systems in mission-critical environments
- Fluency in one or more automation languages (PowerShell, Python, CLI etc.) 
- Deep understanding of ITIL and modern incident management frameworks, with a track record of evolving processes for agility and scale.
- Mastery of high availability architectures, disaster recovery strategies, business continuity planning, and advanced performance tuning for distributed systems.
- Demonstrates strategic thinking, quantitative and analytical skills, team leadership, and collaboration 
- Excellent problem resolution, judgment, negotiating and decision-making skills
- Desired Strong knowledge of Windows Platform or Linux, developer tools and ability to diagnose and debug user code
- Proven ability to triage, prioritize, and execute multiple critical workstreams in alignment with strategic objectives under time constraints.
- Excellent communication skill (written + verbal) in English, especially in high-pressure scenarios.
- Ability to communicate with a variety of audiences; including high-profile customers, executive management, and engineering teams.
- Deep, hands-on expertise with Azure, AWS, or GCP core services, including the ability to architect and troubleshoot complex interdependent systems.
- Bachelor’s or master’s degree in computer science, Information Technology or equivalent experience

Preferred Qualifications:

- 10+ Years of demonstrated experience as an Incident Commander or Crisis Manager for critical, high-severity incidents in high-availability, distributed environments.
- Experience with SRE (Site Reliability Engineering) principles and practices.
- Advanced exposure to chaos engineering, systemic fault injection, and designing for failure-resilient, self-healing architectures

About Company

Microsoft Corporation is a leading American multinational technology company founded in 1975 by Bill Gates and Paul Allen. Headquartered in Redmond, Washington, Microsoft is best known for its software products, including the Windows operating system, Microsoft Office Suite, and Azure cloud services. The company also produces hardware like the Surface devices and owns LinkedIn, GitHub, and the Xbox gaming brand. Microsoft is one of the world's most valuable companies, playing a key role in personal computing, enterprise software, AI, and cloud computing.

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.

Know more

Senior Service Engineer

Microsoft

Job Description

Overview

Qualifications

About Company

Services you might be interested in

Improve Your Resume Today