What Youll Do:

Ensure reliability and high availability of Java and microservices-based applications through proactive monitoring and automation.

Define and track SLIs/SLOs to maintain service performance and stability.

Troubleshoot and resolve production issues , performing detailed root cause analysis to prevent recurrence.

Build and enhance observability using Prometheus, Grafana, Loki, or New Relic .

Automate operational tasks deployments, scaling, rollbacks, diagnostics, and alerting .

Collaborate with engineering and DevOps teams to integrate reliability practices into the CI/CD pipeline.

Drive AIOps initiatives for intelligent alert correlation and predictive incident management.

Mentor teams on best practices in monitoring, performance optimization, and operational efficiency.

What Were Looking For:

36 years of experience in Site Reliability Engineering, Application Operations, or DevOps .

Strong hands-on experience with Java, Spring Boot , and microservices architecture .

Proficiency in monitoring tools (Prometheus, Grafana, Loki, New Relic, or similar).

Experience with Kubernetes , containers , and cloud platforms (AWS, Azure, or GCP).

Strong scripting skills in Bash, Python, or Go for automation and diagnostics.

Familiar with incident management, RCA, and performance debugging .

Exposure to AIOps tools or AI/LLM-based observability platforms is a plus.

Excellent problem-solving and communication skills.

Site Reliability Engineer