Lead Solution Architect
Yotta Data Services Private Limited
5 - 10 years
Mumbai
Posted: 05/01/2026
Job Description
About Yotta:
Yotta Data Services is powering Digital Transformation with Scalable Cloud, Colocation, and Managed Services.
Yotta Data Services offers a comprehensive suite of cloud, data center, and managed services designed to accelerate digital transformation for businesses of all sizes. With state-of-the-art infrastructure, cutting-edge AI capabilities, and a commitment to data sovereignty, we empower organisations to innovate securely and efficiently.
Total Experience:
10+ years in systems engineering, network engineering, cloud infrastructure, or datacenter design.
Key Responsibilities:
1. AI Systems Architecture (Compute, GPU, OS)
Design and deploy large-scale GPU clusters (H100, H200, GB200 and GB300) for distributed training and inference.
Architect multi-node GPU systems using:
NVLink/NVSwitch
PCIe Gen5
Define OS, kernel, driver, and runtime configurations optimized for AI workloads (CUDA/ROCm, NCCL, UCX, OFED).
Develop high-performance compute blueprints for diverse use cases:
training, fine-tuning, retrieval, and batch inference.
2. High-Performance Networking
Architect AI fabric networks including:
InfiniBand HDR/NDR/XDR/SPX
RoCEv2 / RDMA
100/200/400/800 Gbps Ethernet fabrics
Design low-latency, high-bandwidth topologies (fat-tree, dragonfly+, multiplane architectures).
Plan and tune inter-node communication for distributed AI training (NCCL, MPI, UCX).
Implement network segmentation, isolation, and multi-tenant security for
AI compute clusters.
3. Storage & Data Pipeline Infrastructure
Architect high-throughput storage solutions for AI:
Parallel file systems (Lustre, BeeGFS, IBM Spectrum Scale)
Cloud-native high-performance storage (FSx for Lustre, Azure ANF, GCS Filestore High Scale)
NVMe, NVMe-over-Fabrics, object storage.
Optimize data pipelines for large-scale dataset ingestion, feature extraction,
checkpointing, and streaming.
4. Platform Integration & Orchestration
Integrate systems with Kubernetes GPU environments (EKS/AKS/GKE, K8s onprem, Kueue, Volcano).
Design infrastructure to support distributed training frameworks:
PyTorch DDP
DeepSpeed
Ray Train
JAX / TPU alternatives
Enable robust scheduling, multi-tenancy, and job orchestration.
5.Reliability, Monitoring & Performance Optimization
Implement monitoring for GPU utilization, network telemetry, I/O performance, and cluster health (Prometheus, Grafana, DCGM, NetQ).
Conduct performance tuning across:
NIC/driver stack
GPU topology
Storage throughput
Network congestion management (ECN, PFC, QoS)
Design systems for high availability, resilience, and disaster recovery.
6.Security & Compliance (Infra-Level)
Implement hardware-level and network-level security controlsIAM, RBAC, ACLs, segmentation, encryption in transit.
Architect secure multi-tenant GPU environments, including confidential computing where supported.
Ensure system compliance with SOC2, ISO 27001, or industry-specific security frameworks.
Good to have skills:
Experience building clusters for AI training at >100 GPUs scale.
Familiarity with AI data engineering systems (Kafka, Spark, Ray Data).
Experience with bare-metal provisioning tools (MAAS, iPXE, Metal).
Knowledge of GPU virtualization, MIG/partitioning, or multi-tenant GPU scheduling.
Qualification Criteria:
10+ years in systems engineering, network engineering, cloud infrastructure, or datacenter design.
Deep hands-on experience with:
GPU systems (NVIDIA)
InfiniBand / RDMA / RoCEv2
High-performance storage solutions
Linux systems tuning
HPC/AI cluster design
Strong networking background (L2/L3 switching, routing, QoS, congestion control, BGP/EVPN).
Familiarity with AI frameworks and distributed training, even if not a data scientist.
Expertise with infrastructure automation:
Terraform
Ansible
Kubernetes manifests / Helm
Interested candidates can share their updated resume at
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
