Dear all,

We are looking for a GPU Infrastructure Specialist to manage and optimize GPU-based environments for model hosting and high-performance computing workloads. The ideal candidate will have hands-on experience with NVIDIA/ AMD, SambaNova GPU ecosystems, and a strong background in resource management, performance tuning, and observability within large-scale AI/ML environments.

Responsibilities

Manage, configure, and maintain GPU infrastructure across on-premise and cloud environments.
Handle GPU resource allocation, scheduling, and orchestration for AI/ML workloads.
Oversee driver updates, operator management, and compatibility across multiple GPU vendors (NVIDIA, AMD, SambaNova).
Implement GPU tuning and performance optimization strategies to ensure efficient model inference and training performance.
Monitor GPU utilization, latency, and system health using observability and alerting tools (e.g., Prometheus, Grafana, NVIDIA DCGM, etc.).
Collaborate with AI engineers, DevOps, and MLOps teams to ensure seamless model deployment and hosting across GPU clusters.
Develop automation scripts and workflows for GPU provisioning, scaling, and lifecycle management.
Troubleshoot GPU performance issues, memory bottlenecks, and hardware-level anomalies.

Qualifications

Strong experience managing GPU infrastructure (NVIDIA, AMD, SambaNova).
Proficiency in resource scheduling and orchestration (Kubernetes, Slurm, Ray, or similar).
Knowledge of driver and operator management in multi-vendor environments.
Experience with GPU tuning, profiling, and performance benchmarking.
Familiarity with observability and alerting tools (Prometheus, Grafana, ELK Stack, etc.).
Hands-on experience with model hosting platforms (Triton Inference Server, TensorRT, ONNX Runtime, etc.) is a plus.
Working knowledge of Linux systems, Docker/Kubernetes, and CI/CD pipelines.
Strong scripting skills in Python, Bash, or Go.

Preferred Skills

Bachelors or Masters degree in Computer Science, Engineering, or related field.
Certifications in GPU computing (e.g., NVIDIA Certified Administrator, CUDA, or similar).
Experience with AI/ML model lifecycle management in production environments.

GPU Infrastructure Specialist

Watsonite

Let experts apply while you prepare for interviews

Job Description

Services you might be interested in

We Search & Apply Jobs for You!