L3 Storage Specialist
Stealth AI Startup
2 - 5 years
Chennai
Posted: 23/12/2025
Job Description
L3 Storage (VAST) Specialist
Location: Chennai
Employment Type: Full-Time
Experience: Relevant expertise in VAST Data storage platforms
About Us
We are a well-funded stealth AI startup building next-generation AI infrastructure and high-performance data systems. To support our fast-scaling environment, we are looking for experienced L3 Storage (VAST) Specialists to join our core infrastructure team.
Role Overview
As an L3 Storage (VAST) Specialist, you will manage, operate, troubleshoot, and optimize our VAST clusters (CBox/DBox). You will serve as the SME responsible for ensuring reliability, continuity, and performance of mission-critical storage systems powering large-scale AI workloads.
This role requires hands-on experience with VAST Datas architecture along with strong incident handling, RCA delivery, upgrade execution, and coordination with VAST Support.
Key Responsibilities
- Manage and maintain VAST clusters (CBox/DBox) across multi-node deployments.
- Handle L3-level storage incidents end-to-end, including diagnosis, resolution, and preventive actions.
- Lead storage upgrade planning, scheduling, execution, and rollback readiness.
- Work closely with VAST Support for escalations, bug reviews, and advanced troubleshooting.
- Conduct and publish Root Cause Analyses (RCAs) for critical incidents.
- Perform proactive performance analysis and tuning to support demanding AI/ML workloads.
- Collaborate with internal SRE, Platform, and Infrastructure teams to ensure system stability.
- Maintain detailed documentation of configurations, runbooks, and upgrade paths.
- Contribute to building a Dedicated VAST SME Pool ensuring long-term continuity in upgrades, RCAs, and performance investigations.
Required Skills & Experience
- Hands-on experience with VAST Data storage platform , including CBox/DBox.
- Strong understanding of distributed storage systems, NVMe, RDMA, NFS/SMB/ISCSI protocols.
- Proven expertise in troubleshooting large-scale storage environments.
- Experience coordinating with storage OEM/vendor support teams.
- Ability to work during business hours and extended hours , if required (no 247 shift model).
- Solid understanding of monitoring, capacity planning, and performance engineering.
- Strong analytical and documentation skills.
Nice to Have
- Experience supporting HPC, GPU clusters, or AI workloads.
- Background in SRE, distributed systems, or data center operations.
- Automation/scripting skills (Python, Bash, Ansible, etc.).
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
