🔔 FCM Loaded

HPC Engineer

HCLTech

2 - 5 years

Noida

Posted: 05/02/2026

Getting a referral is 5x more effective than applying directly

Job Description

Position Overview (Job Summary):

  • The role is for an HPC Engineer responsible for designing, deploying, managing, and optimizing an on-premises High Performance Computing (HPC) environment.
  • The environment includes SLURM-managed CPU and GPU clusters .
  • Strong emphasis on HPC architecture, Linux administration, job scheduling, and cluster operations .
  • Experience with parallel/distributed storage (WekaFS, Scality) is preferred but optional .

Primary Skills:

  1. HPC Operations & Cluster Management (CPU & GPU)
  • SLURM Workload Manager (Mandatory) Install/configure/manage SLURM across multiple clusters
  • Partitions/queues, fairshare, job priority, scheduling policies
  • Upgrades, migrations, automation via API/integrations
  • Linux System Administration (RHEL focus) OS patching, hardening, tuning, package management
  • Troubleshooting & Performance Optimization Cluster health, node/job failures, bottlenecks, utilization optimization
  • Parallel Computing Knowledge MPI, OpenMP, distributed execution fundamentals

Secondary Skills (Preferred / Optional):

  • Storage / Parallel File SystemsWekaFS (preferred optional)
  • Scality RING / ARTESCA (preferred optional)
  • GPU Computing Exposure NVIDIA drivers, CUDA familiarity, GPU scheduling concepts
  • Monitoring Tools Grafana, Prometheus
  • Automation / Scripting Bash/Python for workflows, tooling, ops automation
  • HPC Ecosystem Components InfiniBand/100G networking, monitoring tools, storage tiering concepts
  • SLURM-based HPC clusters
  • Linux (RHEL) administration
  • Multi-node distributed systems
  • (Optional) Storage platforms like WekaFS / Scality

Role and Responsibilities:

A. Key Responsibilities

1) HPC Infrastructure & Operations

  • Manage day-to-day operations of on-prem CPU & GPU clusters
  • Monitor health, performance, utilization ; ensure availability & efficiency
  • Implement best practices for:
  • HPC operations
  • user management
  • resource administration
  • Troubleshoot:
  • networking issues
  • node failures
  • job failures
  • performance bottlenecks
  • User support:
  • job submissions
  • resource usage
  • HPC workflows

2) SLURM Workload Manager (Mandatory)

  • Configure/install/manage SLURM across multiple clusters
  • Manage:
  • queues
  • partitions
  • node allocation policies
  • fair share policies
  • job prioritization
  • Handle:
  • SLURM upgrades
  • migrations
  • maintenance activities
  • Work with SLURM APIs/integrations for:
  • automation
  • custom workflows
  • Optimize scheduling for mixed CPU/GPU workloads

3) Linux System Administration

  • Administer:
  • compute nodes
  • head nodes
  • admin servers
  • Perform:
  • OS updates
  • package installs
  • security patching
  • system tuning
  • Automate via:
  • shell scripting (Bash/Python)

4) Parallel Computing & Cluster Architecture

  • Understand and support workloads using:
  • MPI
  • OpenMP
  • distributed execution
  • Work with HPC building blocks:
  • high-speed interconnects (InfiniBand/100G)
  • storage tiers
  • resource managers
  • monitoring tools
  • Diagnose and resolve:
  • parallel workload performance issues

B. Additional Responsibilities (Optional / Preferred Area)

5) Storage (Optional but Preferred)

A. WEKA (WekaFS)

  • Manage/tune parallel file system performance
  • Troubleshoot WekaFS issues with minimal downtime
  • Provide internal guidance and usage best practices
  • Track ecosystem improvements & recommend enhancements

B. Scality

  • Maintain and troubleshoot:
  • Scality RING
  • ARTESCA environments
  • Monitor/tune for high availability & reliability
  • Create documentation (configuration + SOPs)
  • Recommend performance improvements based on product enhancements

Services you might be interested in

Improve Your Resume Today

Boost your chances with professional resume services!

Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.