School of Health: HPC Compute Cluster - Quick Start Guide

Top of page
Introduction
How to use the cluster
SGE / Batch-queuing system
Running MatLAB scripts
Job Monitoring
Back to Main Help page

Introduction

The SOH High Performance Computing (HPC) Cluster consists of 12 dedicated servers each housing dual Intel Xeon processors with 128 GB of RAM. The cluster uses an expandable grid computing architecture that allows researchers to add future workstations to it in order to tackle difficult and computationally intensive tasks.

With over 192 CPU cores and 1.5TB of total RAM, the HPC cluster is an excellent tool for Neuroimaging, signal processing, statistics and big data analytics. The cluster has access to over 100 TB of storage with dedicated backups. Popular scientific packages that have been installed on it include: Matlab, FSL, Minc-toolkit, SPM, and FreeSurfer. The HPC cluster is a great complement to SOH's platforms, such as the imaging suite and sleep lab, and facilitates researchers in processing their data.

Intro to PERFORM Cluster PowerPoint presentation.

Do note that this cluster is now quite old and showing it's age, having been built in 2015, and is in "legacy support" only.

That means it's provided in mostly an "as-is" state, and no major upgrades or repairs will be done if a major problem occurs until it is (or "if" it is) replaced completely.

How to use the cluster

Connection Steps:

Ensure that you have an internet connection to the Concordia network (via Ethernet / Wifi), or that you are on Concordia's VPN: https://www.concordia.ca/it/services/vpn.html
To login to the cluster:
```
ssh <concordia_netname>@perform-hpc
```
Please note that if this is your first time trying to login, you might receive a permission denied error, and then you will have to try the process 5 minutes later.
After you login it might ask you for your password a second time to create a keytab that allows you access the storage server - please do so.
Once that is complete you can now start using the cluster. The login node (perform-hpc) should only be used for the uploading and downloading of data. In order to process data or use scientific packages, you'll need to ssh to a server. You can do this by typing:
```
goto_execution_node
```

Usage Steps:

Before you begin working or submit a job, you'll need to load the appropriate paths ("modules") of each program you wish to use. To see what modules are available you can type:

module avail

---------------------------------------- /util/modules/repo -------------------------------------
AFNI/16.2.18	matlab/R2017a
AFNI/19.0.21	matlab/R2017b
Anaconda/4.2.0_using_python2.7.12	matlab/R2018b
Anaconda/4.2.0_using_python3.5.2	matlab/R2018b_local
Anaconda/5.0.1_using_python3.6.3	minc-stuffs/0.1.16_built_with_minc-1.0.08
Anaconda/5.2.0_using_python2.7.15	minc-stuffs/0.1.16_built_with_minc-1.9.11
ANTs/2016-09-19	minc-toolkit/1.0.08
ANTs/2.2.0	minc-toolkit/1.9.11
ANTs/2.3.1	minc-toolkit/1.9.15(default)
ARPACK/96	minc-toolkit/1.9.16
ARPACK-NG/3.4.0	mipav/7.3.0(default)
brainstorm/3	mipav/7.4.0
brain-view2/0.1.1_built_with_minc-1.9.11	mni_cortical_statistics/0.9.4
BrainVISA/4.2.1	MRIcroGL/2016-10-10
BrainVISA/4.4.0	MRIcron/2016-10-12
BrainVISA/4.5.0	MRtrix3/2018-05-16
dcm2niix/1.0.20171017	NIAK/1.1.4
dcm2niix/1.0.20171215	NIAK/1.1.4_dep
dcm2niix/1.0.20181125	octave/4.2.0
dicom-toolkit/3.6.1	OpenCV/3.1.0
FID-A/1.0	oxford_asl/3.9.15
FID-A/2017-06-07	ParaView/5.2.0
FID-A/2017-08-02	pyminc/0.48
FreeSurfer/5.3.0	R/3.1.3
FreeSurfer/6.0.0	R/3.3.1
FSL/5.0.10(default)	R/3.5.1
FSL/5.0.11	REST/1.8
FSL/5.0.9	RMINC/1.4.2.1_built_with_R-3.3.1_and_minc_1.0.08
Gannet/2.0	RMINC/1.4.2.1_built_with_R-3.3.1_and_minc_1.9.11
ITK/4.9.1	SCT/3.2.7
Mango/4.0.1	SPM/12_r6685
matlab/R2015a	surfstat/2008-09-26
matlab/R2016a	VTK/6.2.0
matlab/R2016b(default)	VTK/7.1.0
matlab/R2016b2	xjView/9
matlab/R2016b_local

Load your desired module: (note that the command below defaults to the latest version of the package. If you want a specific version, you need to specify it - i.e. module load minc-toolkit/1.0.08 . You should also take note of what version you use, so you can use the same one in a future analysis. Also, don't type the (default) suffix"):
```
module load minc-toolkit
```
You can uload modules with:
```
module unload 
```
You can remove all loaded modules with:
```
module purge
```
If you forget which ones are loaded, you can List the ones your have loaded with:
```
module list
```
If your module requires other modules to function, it will inform you to load them:
user@perf-hpc01:/util/modules$ module load spm
spm/12_r6685(3):ERROR:151: Module 'spm/12_r6685' depends on one of the module(s) 'matlab/R2016b matlab/R2016a'
spm/12_r6685(3):ERROR:102: Tcl command execution failed: prereq matlab
Similarly, if a module conflicts with another module, they can't both be loaded at the same time:
user@perf-hpc01:/util/modules$ module load minc-toolkit
user@perf-hpc01:/util/modules$ module load freesurfer
freesurfer/5.3.0(5):ERROR:150: Module 'freesurfer/5.3.0' conflicts with the currently loaded module(s) 'minc-toolkit/1.9.11'
freesurfer/5.3.0(5):ERROR:102: Tcl command execution failed: conflict minc-toolkit

SGE / Batch-queuing system

Utilizing the SOH queue is the best way to quickly process large amount of data. When you submit a job to the queue, it gets assigned to one of the cluster computers that has the available resources (RAM & CPU) to process it. The compute nodes (perf-hpc01 through perf-hpc12) will process jobs first, followed by workstation nodes.

Additionally, graphically intense programs (such as MATLAB’s desktop environment / GUI ) should rarely be used on the cluster since you will encounter latency (lag). These programs should really only be used for testing, creating scripts, or getting instant feedback. For maximum performance you should write scripts that do the calculations in the background without having to load all of the graphical libraries. Alternatively, you could use these programs on your own computer to develop a script and then copy it over to the cluster to process your data.

Submitting Jobs to the queue

A job can be submitted to the cluster any of the 2 ways:

From within a script:

#####example of a script resampleimage.sh
#!/bin/bash
qsub -j y -o logs/mnc_out.txt -V -cwd -q all.q -N mncresample <<END
mincresample /path/to/image/file.mnc -like /path/to/image/template.mnc out.mnc
END

Where:

-o logs/mnc_out.txt	is the output of the job. Make sure that the logs folder exists
-q all.q	is the queue that you are submitting the job to
-N mncresample	is the name displayed in the queue
mincresample …	is the command(s) you want executed
-V	exports the environment variables (loaded paths & modules)
-j y	merge the error stream and the output stream into the logfile
-cwd	active the job from the current working directory

Note that you would have to make the script executable after (i.e. chmod 755 resampleimage.sh ) and load any module that the commands would rely on prior to executing the script.

From the command line:
```
qsub -j y -o logs/mnc_out.txt -V -cwd -q all.q -N mncresample ./resampleimage2.sh
```
Where ./resampleimage2.sh is the name of any script that you want to execute. In this example, the resampleimage2.sh script would only contain the mincresample line and not the qsub and END lines.

Advanced Flags

If you are running jobs that requests a certain amount of cores or RAM, you can request it. Note that the job won’t process until a machine that can satisfy your request entirely becomes available.

Add the following flags to your qsub command:

-pe smp <num_cores>	reserves <num_cores> which is a number from 1-32
-l h_vmem=<num>G	reserves <num> GB of RAM, i.e. -l h_vmem=12G reserves 12GB

Running MATLAB scripts

If you are going to run a MATLAB script (.m file) with qsub, then you need to make sure that you launch MATLAB first so it can read the .m file. If not, qsub won't know what to do with your script. Here is a sample command:

echo matlab -nojvm -nodisplay -nosplash -nodesktop -r \"run ./m.m\" | qsub -j y -o logs/mat.txt -V -cwd -q matlab.q -N matlabJob

where m.m is the MATLAB script that you are running. If you notice the command is similar to the option #2 for "Submitting Jobs to the queue", with the difference that you are piping in the command you want to use (i.e. echo matlab -nodisplay -nosplash -nodesktop -r \"run ./m.m\" | ). It is important that you specify the -nodesktop flag for MATLAB, since the cluster isn't run using MATLAB's graphical desktop. Also, this command is being run from the same directory as the m.m file, so if you want this to work from any directory, replace ./ with the path to your .m file.

Monitoring a job on the queue

If you want to see the progress of your job in the queue you can use the following commands:

qstat	to get a quick summary of your job status - either pending ( qw ), running on a machine ( r ), or in the error state ( eqw )
qstat -f	to see your jobs in relation to the each machine on the cluster (which is a more detailed overview of qstat). You can also see the status of the cluster with it.
qstat -f -u \*	to see every job running on the cluster (this can be helpful to see if the queue being used by others, and if you should expect that you will have to wait awhile.

Deleting a job on the queue

qdel <job-ID>	To delete a specific job
qdel -u <username>	To delete all of your jobs in the queue

Job status / Error checking:

qstat -j <job-ID>

To check the status of a job

This can be useful when the job is in the error queue status ( eqw )

Reserving a cpu core

If you are working on the cluster, but don’t want to submit a job to the queue. You can reserve a cpu core with the following command:

qrsh

The qrsh command assigns to you a machine at random.

Last update: 2023-09-16

BookR: School of Health Core Facilities Booking