We are seeking a Site Reliability Engineer (SRE) to support and maintain a 247 Azure cloud environment, ensuring high availability, reliability, and performance of infrastructure and hosted services. This role requires the engineer to operate across L1 and L2 support responsibilities, combining proactive monitoring with advanced troubleshooting and root cause analysis. The ideal candidate is an IT Generalist with strong networking, system administration, and customer service skills, capable of owning to customer issues end-to-end in dynamic and evolving environments. The role demands the ability to troubleshoot complex technical problems and communicate solutions clearly to both technical and nontechnical stakeholders.

Azure Cloud Infrastructure Support (L1 & L2)

Provide 24X7 monitoring, support, and maintenance of Azure cloud infrastructure to ensure high availability, performance, security, and reliability.

Perform real-time monitoring and alert response using Azure Monitor, Log Analytics, Application Insights, and third-party monitoring tools.

Manage and support Azure Virtual Machines (Windows and Linux) including provisioning, scaling, start/stop, patching, backup, restore, and performance troubleshooting.

Support Azure networking components including Virtual Networks (VNets), Subnets, Network Security Groups (NSGs), User Defined Routes (UDRs), Load Balancers, Application Gateways, Azure Firewall, VPN Gateways, and ExpressRoute connectivity.

Administer and support Azure Active Directory / Entra ID, including user and group management, rolebased access control (RBAC), conditional access, and identity troubleshooting.

Support Azure Storage services (Blob, File, Disk, Queue, Table) including access control, performance tuning, capacity management, and issue resolution.

Provide L1/L2 support for Azure PaaS services such as App Services, Azure SQL, Managed Instances, and Azure Kubernetes Service (AKS), focusing on availability, connectivity, and configuration-related issues.

Perform capacity planning and performance analysis, proactively identifying resource constraints and recommending scaling or optimization actions.

Manage cost monitoring and optimization activities by identifying underutilized resources, supporting right-sizing efforts, and providing usage insights. System Administration & Remote Support

Manage users in Windows Terminal Server / Remote Desktop Services (RDS) environments, both onpremises and hosted.

Act as a Remote Administrator for Customer Windows Servers, including user account management, shared file access, print services, and print queue configuration.

Support Windows print services, server backup, restore, and recovery operations.

Perform routine OS patching, system maintenance, and health checks across Windows and Linux environments. Networking & Firewall Administration

Support and administer Fortinet / FortiGate firewalls, including policy management, site-to-site and dial-up VPN configuration, VLAN setup, and basic network troubleshooting.

Troubleshoot customer WAN, LAN, and VPN connectivity issues, including routing and firewall policy problems.

Diagnose and resolve local LAN and network-related issues impacting application and infrastructure availability.

Collaborate with internal and customer network teams to ensure secure and reliable connectivity. Customer Support & Incident Ownership

Remotely troubleshoot and resolve technical issues over phone and remote tools, taking full ownership from initial contact to resolution.

Communicate complex technical issues and solutions clearly to non-technical or less tech-savvy users.

Provide timely updates to customers during incidents, escalations, and service restoration activities.

Handle after-hours and emergency support requests on a rotational on-call basis.

Perform incident, problem, and change management activities in line with ITIL processes, including triage, escalation, RCA, and post-incident reviews.

Conduct root cause analysis (RCA) for recurring or critical incidents and implement preventive measures to reduce future outages. Documentation & Collaboration

Document incidents, resolutions, and troubleshooting steps in the call tracking / ticketing system in accordance with defined documentation standards.

Develop and maintain runbooks, SOPs, and knowledge articles for recurring issues.

Build strong working relationships with customers and internal teams to improve service quality and operational efficiency

Site Reliability Engineer

Velodata Global Pvt Ltd

Job Description

Services you might be interested in

Improve Your Resume Today