HPC Engineer
HCLTech
2 - 5 years
Noida
Posted: 05/02/2026
Getting a referral is 5x more effective than applying directly
Job Description
Position Overview (Job Summary):
- The role is for an HPC Engineer responsible for designing, deploying, managing, and optimizing an on-premises High Performance Computing (HPC) environment.
- The environment includes SLURM-managed CPU and GPU clusters .
- Strong emphasis on HPC architecture, Linux administration, job scheduling, and cluster operations .
- Experience with parallel/distributed storage (WekaFS, Scality) is preferred but optional .
Primary Skills:
- HPC Operations & Cluster Management (CPU & GPU)
- SLURM Workload Manager (Mandatory) Install/configure/manage SLURM across multiple clusters
- Partitions/queues, fairshare, job priority, scheduling policies
- Upgrades, migrations, automation via API/integrations
- Linux System Administration (RHEL focus) OS patching, hardening, tuning, package management
- Troubleshooting & Performance Optimization Cluster health, node/job failures, bottlenecks, utilization optimization
- Parallel Computing Knowledge MPI, OpenMP, distributed execution fundamentals
Secondary Skills (Preferred / Optional):
- Storage / Parallel File SystemsWekaFS (preferred optional)
- Scality RING / ARTESCA (preferred optional)
- GPU Computing Exposure NVIDIA drivers, CUDA familiarity, GPU scheduling concepts
- Monitoring Tools Grafana, Prometheus
- Automation / Scripting Bash/Python for workflows, tooling, ops automation
- HPC Ecosystem Components InfiniBand/100G networking, monitoring tools, storage tiering concepts
- SLURM-based HPC clusters
- Linux (RHEL) administration
- Multi-node distributed systems
- (Optional) Storage platforms like WekaFS / Scality
Role and Responsibilities:
A. Key Responsibilities
1) HPC Infrastructure & Operations
- Manage day-to-day operations of on-prem CPU & GPU clusters
- Monitor health, performance, utilization ; ensure availability & efficiency
- Implement best practices for:
- HPC operations
- user management
- resource administration
- Troubleshoot:
- networking issues
- node failures
- job failures
- performance bottlenecks
- User support:
- job submissions
- resource usage
- HPC workflows
2) SLURM Workload Manager (Mandatory)
- Configure/install/manage SLURM across multiple clusters
- Manage:
- queues
- partitions
- node allocation policies
- fair share policies
- job prioritization
- Handle:
- SLURM upgrades
- migrations
- maintenance activities
- Work with SLURM APIs/integrations for:
- automation
- custom workflows
- Optimize scheduling for mixed CPU/GPU workloads
3) Linux System Administration
- Administer:
- compute nodes
- head nodes
- admin servers
- Perform:
- OS updates
- package installs
- security patching
- system tuning
- Automate via:
- shell scripting (Bash/Python)
4) Parallel Computing & Cluster Architecture
- Understand and support workloads using:
- MPI
- OpenMP
- distributed execution
- Work with HPC building blocks:
- high-speed interconnects (InfiniBand/100G)
- storage tiers
- resource managers
- monitoring tools
- Diagnose and resolve:
- parallel workload performance issues
B. Additional Responsibilities (Optional / Preferred Area)
5) Storage (Optional but Preferred)
A. WEKA (WekaFS)
- Manage/tune parallel file system performance
- Troubleshoot WekaFS issues with minimal downtime
- Provide internal guidance and usage best practices
- Track ecosystem improvements & recommend enhancements
B. Scality
- Maintain and troubleshoot:
- Scality RING
- ARTESCA environments
- Monitor/tune for high availability & reliability
- Create documentation (configuration + SOPs)
- Recommend performance improvements based on product enhancements
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
