🔔 FCM Loaded

Site Reliability Engineer

Veca Consulting Pvt Ltd

2 - 5 years

Bengaluru

Posted: 10/12/2025

Getting a referral is 5x more effective than applying directly

Job Description

Role Name: SRE & Devops Engineer(ray.io)

Experience : 3+ Years

Location : Bangalore(No relocation)

Notice Period : 20-30 days(who are currently serving)

Mode : Hybrid

Type : Fulltime

Job Functions:

You will be a member of our AI Platform Team, supporting the next generation AI architecture for various research and engineering teams within the organization.

You'll partner with vendors and the infrastructure engineering team for security and service availability

You'll fix production issues with engineering teams, researchers, data scientists, including performance and functional issues

Diagnose and solve customer technical problems Participate in training customers and prepare reports on customer issues

Be responsible for customer service improvements and recommend product improvements

Write support documentation

You'll design and implement zero-downtime to monitor and accomplish a highly available service (99.999%)

As a support engineer, find opportunities to automate as part of the problem management process, creating automation to avoid issues.

Define engineering excellence for operational maturity

You'll work together with AI platform developers to provide the CI/CD model to deploy and configure the production system automatically

Develop and follow operational standard processes for tools and automation development. Including: Style guides, versioning practices, source control, branching and merging patterns and advising other engineers on development standards

Deliver solutions that accelerate the activities, phenomenal engineers would perform through automation, deep domain expertise, and knowledge sharing

Required Skills:

Demonstrated ability in designing, building, refactoring and releasing software written in Python, C++.

Hands-on experience with Ray.io, including workload management, cluster deployment, distributed task scheduling, and troubleshooting.

Ability to use Ray Dashboard and CLI tools for monitoring, resource tracking, debugging distributed jobs, and resolving production issues.

Having knowledge of Ray ecosystem libraries such as Ray Train, Ray Tune, Ray Serve, and Ray Data is a big plus.

Experience integrating Ray with tools such as Airflow, MLflow, Dask, DeepSpeed is a big plus.

Debugging and triaging skills.

Cloud technologies like Kubernetes, Docker and Linux fundamentals.

Familiar with DevOps practices and continuous testing.

DevOps pipeline and automations: app deployment/configuration & performance monitoring.

Test automations, Jenkins CI/CD.

Excellent communication, presentation, and leadership skills to be able to work and collaborate with partners, customers and engineering teams.

Well organized and able to manage multiple projects in a fast paced and demanding environment.

Good oral/reading/writing English ability.

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.