Service Engineer
Microsoft
5 - 10 years
Hyderabad
Posted: 23/08/2025
Job Description
Overview
Are you passionate about cloud computing, obsessed with customer experience, and driven to resolve complex issues under pressure? Do you thrive in high-stakes, live environments and want to play a pivotal role in ensuring the reliability of Microsoft’s cloud platform? If so, the Azure Customer Experience (CXP) team has the opportunity for you.
Microsoft Azure is one of the most exciting and strategic products at Microsoft—powering mission-critical workloads for enterprises, governments, and startups around the world. Azure delivers on-demand, hyper-scale infrastructure and platforms via Microsoft's global data centers, enabling customers to build, host, and scale their applications with confidence.
The Customer Reliability Engineering (CRE) team within Azure CXP is a top-level pillar of Azure Engineering responsible for world-class live-site management, customer reliability engagements, modern customer-first experiences for scale, and drives deep customer insights and empathy into the broader Azure Engineering organization. Our “no dead-end’s” philosophy ensures that every customer, regardless of size or scale, can realize their full potential through the Microsoft Cloud
We are seeking decisive and experienced Service Engineers with proven incident and crisis management experience. These engineers will manage Live Site issues, drive Problem Management, and enhance customer reliability.
The ideal candidate will possess deep technical expertise in Azure Core Services and their intricate interdependencies, coupled with a proven ability to manage complex, highly available services on a scale. As the single point of command and control during high-severity incidents, you will orchestrate cross-functional engineering, operations, and communications to swiftly restore services, minimize impact, and safeguard the trust of our global customer base
You will work closely with Customers, First Parties, Customer Support, Livesite, and Engineering teams to deliver critical, customer-facing features. Success in this role requires the ability to influence and collaborate across many Azure servicing teams to ensure customer needs are met. You’ll be surrounded by elite developers, data scientists, and customer-obsessed engineers who care deeply about continuous improvement and resilient cloud operations.
In addition, this role includes on-call responsibilities for managing and resolving complex multi-service outages. It requires the ability to remain effective under pressure, apply broad technical and analytical skills, and coordinate seamlessly with internal service teams and stakeholders. Strong communication skills—both written and verbal—are essential. You will also lead the evolution of Azure's Incident Management practice through Post-Incident Reviews, process development, and system automation. By leveraging telemetry and metrics, you will identify and drive platform-wide improvements with global impact. You’ll be the single point of command and control during high-severity incidents, orchestrating cross-functional engineering, operations, and communications to minimize impact, restore services quickly, and protect the trust of our global customer base.
This role offers a unique opportunity to make immediate impact, improve systems at scale.
Qualifications
Required Qualifications
- 5+ years’ proven expertise in mission-critical cloud operations, high-severity incident response, SRE, or large-scale systems engineering on hyperscale platforms like Azure, AWS, or GCP.
- Must have Service Engineering experience in a 24 x 7 x 365 enterprise environments
- Exceptional command-and-control communication skills—able to drive clarity and direction with customers - internal Microsoft stake holders and third-party vendors during ambiguity and chaos.
- Deep understanding of cloud architecture patterns, microservices, and containerization.
- Demonstrated ability to make decisions quickly, under pressure, and with limited data—without compromising long-term reliability.
- Familiarity with monitoring and observability tools (e.g., Grafana, Prometheus, Datadog, Splunk, New Relic).
- Contribute to Implement observability frameworks to proactively detect performance bottlenecks.
- Strong knowledge of CI/CD pipelines, container orchestration (Kubernetes, Docker), and infrastructure as code (Terraform, ARM, Bicep).
- Familiarity with AI/ML frameworks and cloud AI services.
- Experience implementing AI-driven monitoring, alerting, and remediation systems
- Fluency in one or more automation languages (PowerShell, Python, CLI etc.)
- Understanding ITIL or other incident management frameworks is a must.
- Understand High Availability, Disaster Recovery, Business Continuity, Performance Tuning
- Demonstrates strategic thinking, quantitative and analytical skills, team leadership, and collaboration
- Excellent problem resolution, judgment, negotiating and decision-making skills
- Desired Strong knowledge of Windows Platform or Linux, developer tools and ability to diagnose and debug user code
- Effectively manage and prioritize multiple tasks in accordance with high level objectives/projects.
- Excellent communication skill (written + verbal) in English, especially in high-pressure scenarios.
- Ability to communicate with a variety of audiences; including high-profile customers, executive management, and engineering teams.
- Experience with Azure, AWS, or GCP core services and their interdependence.
- Bachelor’s or master’s degree in computer science, Information Technology or equivalent experience
About Company
Microsoft Corporation is a leading American multinational technology company founded in 1975 by Bill Gates and Paul Allen. Headquartered in Redmond, Washington, Microsoft is best known for its software products, including the Windows operating system, Microsoft Office Suite, and Azure cloud services. The company also produces hardware like the Surface devices and owns LinkedIn, GitHub, and the Xbox gaming brand. Microsoft is one of the world's most valuable companies, playing a key role in personal computing, enterprise software, AI, and cloud computing.
Services you might be interested in
One-Shot Campaign
Reach out to ideal employees in one shot!
The intelligent campaign for reaching out to the ideal audience to whom you can ask for help (guidance or referral).