Application Support Specialist
Landmark Group
2 - 5 years
Bengaluru
Posted: 17/02/2026
Job Description
Job Summary:
Were looking for a Site Reliability Engineer (SRE)/ Application Support Engineer with 2-5 years of experience with strong technical and analytical skills to ensure the reliability, scalability, and performance of our core applications.
This role focuses on improving the stability and efficiency of distributed systems built on Java and microservices architecture, driving operational excellence through monitoring, automation, and incident management.
Youll be part of the team that keeps our business-critical systems healthy investigating production issues, implementing preventive measures, and collaborating with engineering teams to improve observability and resiliency.
Key Responsibilities:
Application Reliability & Performance
- Monitor and maintain the health, performance, and reliability of production applications.
- Define, measure, and track SLIs/SLOs for key services, driving improvements proactively.
- Identify performance bottlenecks, memory leaks, and slow transactions in Java-based microservices.
- Partner with development teams to design and deploy resilient, fault-tolerant systems.
- Mentor developers and operations engineers on observability and debugging techniques.
Incident Management & Troubleshooting
- Actively participate in incident response, triaging application issues, and restoring services quickly.
- Perform deep root-cause analysis for recurring incidents and ensure permanent fixes are implemented.
- Own the incident lifecycle from detection to resolution and post-incident review.
- Ensure observability tools and alert thresholds are tuned to reduce false positives and improve signal quality.
Monitoring & Automation
- Enhance visibility across systems through better metrics, logs, and traces using Prometheus, Grafana, and Loki (or similar).
- Automate repetitive tasks deployments, rollbacks, scaling, and diagnostics.
- Build or improve runbooks and self-healing mechanisms to reduce operational toil.
- Integrate AIOps capabilities for smarter alert correlation, anomaly detection, and incident prediction.
Operational Ownership
- Ensure production systems meet availability and performance targets.
- Track open issues, follow up on root cause actions, and drive closure with responsible teams.
- Collaborate with developers, infrastructure, and QA to maintain a consistent and stable release cycle.
- Contribute to continuous improvement of deployment, monitoring, and rollback processes.
Collaboration & Communication
- Work closely with product and platform engineering to integrate reliability into system design.
- Communicate incident status, RCA findings, and reliability metrics to stakeholders.
- Foster a reliability-first culture and advocate for operational excellence across teams.
Required Skills:
- 26 years of experience in Site Reliability Engineering or Application Operations.
- Solid understanding of Java, Spring boot and microservices architecture.
- Proficiency in monitoring and observability tools (Prometheus, Grafana, Loki, New Relic, or equivalent).
- Familiarity with Kubernetes, containers, and CI/CD pipelines.
- Familiarity with incident management, RCA, and performance debugging.
- Experience with cloud platforms (AWS, Azure, or GCP).
- Strong scripting skills (Bash, Python, or Go) for automation and diagnostics.
Good communication and stakeholder collaboration skills
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
