Login Sign Up

HPC Admin

Tata Consultancy Services

2 - 5 years

Hyderabad

Posted: 23/12/2025

Getting a referral is 5x more effective than applying directly

Job Description

Role - Sr. HPC Administrator

Years of Experience - 7 to 12 years

Location - Hyderabad


  • Strong experience in providing support for Linux HPC clusters.
  • Strong working knowledge on Following: o IBM Platform LSF 9 and 10 administration. o Redhat Enterprise Linux Administration. o Lustre Parallel File system. o Mellanox Infiniband Connectivity. o Cluster Manager Administration (HPCM or xCAT) o SSSD & NIS Authentication mechanisms. o Bash & Python scripting. o Ansible playbooks.
  • Experience of Abaqus, and CFD application (Fluent and StarCCM..etc.,)
  • Strong knowledge of application installations and version management on shared file systems.
  • IT infrastructure Technical Operation Management under ITIL framework
  • Security compliance and remediation management.
  • DevOps, ITIL, Agile, Safe (certifications are desirable)
  • Installation, configuration, troubleshooting and administration of Linux HPC clusters (compute, storage, and network) and applications in support of CAE environments.
  • Monitor and analyze LSF job queues and resource utilization to optimize workload management.
  • Troubleshoot and resolve any issues with LSF and its components, including master servers, compute nodes, and resource managers.
  • Collaborate with users to understand their HPC requirements and design LSF job workflows to meet their needs.
  • Develop and maintain LSF documentation, including standard operating procedures, installation guides, and troubleshooting procedures.
  • Develop and maintain LSF scripts for automation and task scheduling.
  • Diagnose and troubleshoot complex RHEL OS, application and HPC cluster technical problems.
  • Interact with hardware and software vendors for external support.
  • Develop and maintain technical solution documents (TSD) and standard operating procedures(SOP).
  • Keep all HPC infrastructure systems/servers/devices up to date and working condition to enhance business continuity.
  • Design and implement HPC network topology, including Mellanox connectivity.
  • Create and maintain HPC capacity planning and periodical cluster utilization reports.
  • Troubleshoot Abaqus, StarCCM+ and Fluent applications, and resolve any issues in a timely manner.
  • Develop and maintain scripts for automation and task scheduling using Python and Bash scripting.

Services you might be interested in

We Search & Apply Jobs for You!

Our team scans through 1000s of opportunities and applies to roles best suited to your profile

Save 100+ hours and focus on what matters - cracking interviews and landing offers.