Job Summary:

Were looking for a Site Reliability Engineer (SRE)/ Application Support Engineer with 2-5 years of experience with strong technical and analytical skills to ensure the reliability, scalability, and performance of our core applications.

This role focuses on improving the stability and efficiency of distributed systems built on Java and microservices architecture, driving operational excellence through monitoring, automation, and incident management.

Youll be part of the team that keeps our business-critical systems healthy investigating production issues, implementing preventive measures, and collaborating with engineering teams to improve observability and resiliency.

Key Responsibilities:

Application Reliability & Performance

Monitor and maintain the health, performance, and reliability of production applications.
Define, measure, and track SLIs/SLOs for key services, driving improvements proactively.
Identify performance bottlenecks, memory leaks, and slow transactions in Java-based microservices.
Partner with development teams to design and deploy resilient, fault-tolerant systems.
Mentor developers and operations engineers on observability and debugging techniques.

Incident Management & Troubleshooting

Actively participate in incident response, triaging application issues, and restoring services quickly.
Perform deep root-cause analysis for recurring incidents and ensure permanent fixes are implemented.
Own the incident lifecycle from detection to resolution and post-incident review.
Ensure observability tools and alert thresholds are tuned to reduce false positives and improve signal quality.

Monitoring & Automation

Enhance visibility across systems through better metrics, logs, and traces using Prometheus, Grafana, and Loki (or similar).
Automate repetitive tasks deployments, rollbacks, scaling, and diagnostics.
Build or improve runbooks and self-healing mechanisms to reduce operational toil.
Integrate AIOps capabilities for smarter alert correlation, anomaly detection, and incident prediction.

Operational Ownership

Ensure production systems meet availability and performance targets.
Track open issues, follow up on root cause actions, and drive closure with responsible teams.
Collaborate with developers, infrastructure, and QA to maintain a consistent and stable release cycle.
Contribute to continuous improvement of deployment, monitoring, and rollback processes.

Collaboration & Communication

Work closely with product and platform engineering to integrate reliability into system design.
Communicate incident status, RCA findings, and reliability metrics to stakeholders.
Foster a reliability-first culture and advocate for operational excellence across teams.

Required Skills:

26 years of experience in Site Reliability Engineering or Application Operations.
Solid understanding of Java, Spring boot and microservices architecture.
Proficiency in monitoring and observability tools (Prometheus, Grafana, Loki, New Relic, or equivalent).
Familiarity with Kubernetes, containers, and CI/CD pipelines.
Familiarity with incident management, RCA, and performance debugging.
Experience with cloud platforms (AWS, Azure, or GCP).
Strong scripting skills (Bash, Python, or Go) for automation and diagnostics.

Good communication and stakeholder collaboration skills

Application Support Specialist

Landmark Group

Job Description

Services you might be interested in

We Search & Apply Jobs for You!