🔔 FCM Loaded

Lead Solution Architect

Yotta Data Services Private Limited

5 - 10 years

Mumbai

Posted: 05/01/2026

Getting a referral is 5x more effective than applying directly

Job Description

About Yotta:

Yotta Data Services is powering Digital Transformation with Scalable Cloud, Colocation, and Managed Services.

Yotta Data Services offers a comprehensive suite of cloud, data center, and managed services designed to accelerate digital transformation for businesses of all sizes. With state-of-the-art infrastructure, cutting-edge AI capabilities, and a commitment to data sovereignty, we empower organisations to innovate securely and efficiently.


Total Experience:

10+ years in systems engineering, network engineering, cloud infrastructure, or datacenter design.


Key Responsibilities:

1. AI Systems Architecture (Compute, GPU, OS)

Design and deploy large-scale GPU clusters (H100, H200, GB200 and GB300) for distributed training and inference.

Architect multi-node GPU systems using:

NVLink/NVSwitch

PCIe Gen5

Define OS, kernel, driver, and runtime configurations optimized for AI workloads (CUDA/ROCm, NCCL, UCX, OFED).

Develop high-performance compute blueprints for diverse use cases:

training, fine-tuning, retrieval, and batch inference.


2. High-Performance Networking

Architect AI fabric networks including:

InfiniBand HDR/NDR/XDR/SPX

RoCEv2 / RDMA

100/200/400/800 Gbps Ethernet fabrics

Design low-latency, high-bandwidth topologies (fat-tree, dragonfly+, multiplane architectures).

Plan and tune inter-node communication for distributed AI training (NCCL, MPI, UCX).

Implement network segmentation, isolation, and multi-tenant security for

AI compute clusters.


3. Storage & Data Pipeline Infrastructure

Architect high-throughput storage solutions for AI:

Parallel file systems (Lustre, BeeGFS, IBM Spectrum Scale)

Cloud-native high-performance storage (FSx for Lustre, Azure ANF, GCS Filestore High Scale)

NVMe, NVMe-over-Fabrics, object storage.

Optimize data pipelines for large-scale dataset ingestion, feature extraction,

checkpointing, and streaming.


4. Platform Integration & Orchestration

Integrate systems with Kubernetes GPU environments (EKS/AKS/GKE, K8s onprem, Kueue, Volcano).

Design infrastructure to support distributed training frameworks:

PyTorch DDP

DeepSpeed

Ray Train

JAX / TPU alternatives

Enable robust scheduling, multi-tenancy, and job orchestration.


5.Reliability, Monitoring & Performance Optimization

Implement monitoring for GPU utilization, network telemetry, I/O performance, and cluster health (Prometheus, Grafana, DCGM, NetQ).

Conduct performance tuning across:

NIC/driver stack

GPU topology

Storage throughput

Network congestion management (ECN, PFC, QoS)

Design systems for high availability, resilience, and disaster recovery.


6.Security & Compliance (Infra-Level)

Implement hardware-level and network-level security controlsIAM, RBAC, ACLs, segmentation, encryption in transit.

Architect secure multi-tenant GPU environments, including confidential computing where supported.

Ensure system compliance with SOC2, ISO 27001, or industry-specific security frameworks.


Good to have skills:

Experience building clusters for AI training at >100 GPUs scale.

Familiarity with AI data engineering systems (Kafka, Spark, Ray Data).

Experience with bare-metal provisioning tools (MAAS, iPXE, Metal).

Knowledge of GPU virtualization, MIG/partitioning, or multi-tenant GPU scheduling.


Qualification Criteria:

10+ years in systems engineering, network engineering, cloud infrastructure, or datacenter design.

Deep hands-on experience with:

GPU systems (NVIDIA)

InfiniBand / RDMA / RoCEv2

High-performance storage solutions

Linux systems tuning

HPC/AI cluster design

Strong networking background (L2/L3 switching, routing, QoS, congestion control, BGP/EVPN).

Familiarity with AI frameworks and distributed training, even if not a data scientist.

Expertise with infrastructure automation:

Terraform

Ansible

Kubernetes manifests / Helm


Interested candidates can share their updated resume at

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.