School of Health: HPC Compute Cluster - Quick Start Guide
Contents
- Top of page
- Introduction
- How to use the cluster
- SGE / Batch-queuing system
- Running MatLAB scripts
- Job Monitoring
Introduction
The SOH High Performance Computing (HPC) Cluster consists of 12 dedicated servers each housing dual Intel Xeon processors with 128 GB of RAM. The cluster uses an expandable grid computing architecture that allows researchers to add future workstations to it in order to tackle difficult and computationally intensive tasks.
With over 192 CPU cores and 1.5TB of total RAM, the HPC cluster is an excellent tool for Neuroimaging, signal processing, statistics and big data analytics. The cluster has access to over 100 TB of storage with dedicated backups. Popular scientific packages that have been installed on it include: Matlab, FSl, Minc-toolkit, SPM, and FreeSurfer. The HPC cluster is the perfect complement to SOH's platforms, such as the imaging suite and sleep lab, and facilitates researchers in processing their data.
How to use the cluster
Connection Steps:
- Ensure that you have an internet connection to the Concordia network (via Ethernet / Wifi), or that you are on Concordia's VPN: https://www.concordia.ca/it/services/vpn.html
- To login to the cluster:
ssh <concordia_netname>@perform-hpc
- Please note that if this is your first time trying to login, you might receive a permission denied error, and then you will have to try the process 5 minutes later.
- After you login it might ask you for your password a second time to create a keytab that allows you access the storage server - please do so.
- Once that is complete you can now start using the cluster. The login node (perform-hpc) should only be used for the uploading and downloading of data.
In order to process data or use scientific packages, you'll need to ssh to a server. You can do this by typing:
goto_execution_node
Usage Steps:
Before you begin working or submit a job, you'll need to load the appropriate paths ("modules") of each program you wish to use. To see what modules are available you can type:
module avail
---------------------------------------- /util/modules/repo ------------------------------------- | |
AFNI/16.2.18 | matlab/R2017a |
AFNI/19.0.21 | matlab/R2017b |
Anaconda/4.2.0_using_python2.7.12 | matlab/R2018b |
Anaconda/4.2.0_using_python3.5.2 | matlab/R2018b_local |
Anaconda/5.0.1_using_python3.6.3 | minc-stuffs/0.1.16_built_with_minc-1.0.08 |
Anaconda/5.2.0_using_python2.7.15 | minc-stuffs/0.1.16_built_with_minc-1.9.11 |
ANTs/2016-09-19 | minc-toolkit/1.0.08 |
ANTs/2.2.0 | minc-toolkit/1.9.11 |
ANTs/2.3.1 | minc-toolkit/1.9.15(default) |
ARPACK/96 | minc-toolkit/1.9.16 |
ARPACK-NG/3.4.0 | mipav/7.3.0(default) |
brainstorm/3 | mipav/7.4.0 |
brain-view2/0.1.1_built_with_minc-1.9.11 | mni_cortical_statistics/0.9.4 |
BrainVISA/4.2.1 | MRIcroGL/2016-10-10 |
BrainVISA/4.4.0 | MRIcron/2016-10-12 |
BrainVISA/4.5.0 | MRtrix3/2018-05-16 |
dcm2niix/1.0.20171017 | NIAK/1.1.4 |
dcm2niix/1.0.20171215 | NIAK/1.1.4_dep |
dcm2niix/1.0.20181125 | octave/4.2.0 |
dicom-toolkit/3.6.1 | OpenCV/3.1.0 |
FID-A/1.0 | oxford_asl/3.9.15 |
FID-A/2017-06-07 | ParaView/5.2.0 |
FID-A/2017-08-02 | pyminc/0.48 |
FreeSurfer/5.3.0 | R/3.1.3 |
FreeSurfer/6.0.0 | R/3.3.1 |
FSL/5.0.10(default) | R/3.5.1 |
FSL/5.0.11 | REST/1.8 |
FSL/5.0.9 | RMINC/1.4.2.1_built_with_R-3.3.1_and_minc_1.0.08 |
Gannet/2.0 | RMINC/1.4.2.1_built_with_R-3.3.1_and_minc_1.9.11 |
ITK/4.9.1 | SCT/3.2.7 |
Mango/4.0.1 | SPM/12_r6685 |
matlab/R2015a | surfstat/2008-09-26 |
matlab/R2016a | VTK/6.2.0 |
matlab/R2016b(default) | VTK/7.1.0 |
matlab/R2016b2 | xjView/9 |
matlab/R2016b_local |
- Load your desired module: (note that the command below defaults to the latest version of the package. If you want a specific version, you need to specify
it - i.e. module load minc-toolkit/1.0.08 . You should also take note of what version you use, so you can use the same one in a future analysis.
Also, don't type the (default) suffix"):
module load minc-toolkit
- You can uload modules with:
module unload
- You can remove all loaded modules with:
module purge
- If you forget which ones are loaded, you can List the ones your have loaded with:
module list
- If your module requires other modules to function, it will inform you to load them:
user@perf-hpc01:/util/modules$ module load spm
spm/12_r6685(3):ERROR:151: Module 'spm/12_r6685' depends on one of the module(s) 'matlab/R2016b matlab/R2016a'
spm/12_r6685(3):ERROR:102: Tcl command execution failed: prereq matlab - Similarly, if a module conflicts with another module, they can't both be loaded at the same time:
user@perf-hpc01:/util/modules$ module load minc-toolkit
user@perf-hpc01:/util/modules$ module load freesurfer
freesurfer/5.3.0(5):ERROR:150: Module 'freesurfer/5.3.0' conflicts with the currently loaded module(s) 'minc-toolkit/1.9.11'
freesurfer/5.3.0(5):ERROR:102: Tcl command execution failed: conflict minc-toolkit
SGE / Batch-queuing system
Utilizing the SOH queue is the best way to quickly process large amount of data. When you submit a job to the queue, it gets assigned to one of the cluster computers that has the available resources (RAM & CPU) to process it. The compute nodes (perf-hpc01 through perf-hpc12) will process jobs first, followed by workstation nodes.
Additionally, graphically intense programs (such as MATLAB’s desktop environment / GUI ) should rarely be used on the cluster since you will encounter latency (lag). These programs should really only be used for testing, creating scripts, or getting instant feedback. For maximum performance you should write scripts that do the calculations in the background without having to load all of the graphical libraries. Alternatively, you could use these programs on your own computer to develop a script and then copy it over to the cluster to process your data.
Submitting Jobs to the queue
A job can be submitted to the cluster any of the 2 ways:
-
From within a script:
#####example of a script resampleimage.sh
#!/bin/bash
qsub -j y -o logs/mnc_out.txt -V -cwd -q all.q -N mncresample <<END
mincresample /path/to/image/file.mnc -like /path/to/image/template.mnc out.mnc
END
Where:-o logs/mnc_out.txt is the output of the job. Make sure that the logs folder exists -q all.q is the queue that you are submitting the job to -N mncresample is the name displayed in the queue mincresample … is the command(s) you want executed -V exports the environment variables (loaded paths & modules) -j y merge the error stream and the output stream into the logfile -cwd active the job from the current working directory
Note that you would have to make the script executable after (i.e. chmod 755 resampleimage.sh ) and load any module that the commands would rely on prior to executing the script. -
From the command line:
qsub -j y -o logs/mnc_out.txt -V -cwd -q all.q -N mncresample ./resampleimage2.sh
Where ./resampleimage2.sh is the name of any script that you want to execute. In this example, the resampleimage2.sh script would only contain the mincresample line and not the qsub and END lines.
Advanced Flags
If you are running jobs that requests a certain amount of cores or RAM, you can request it. Note that the job won’t process until a machine that can satisfy your request entirely becomes available.
Add the following flags to your qsub command:
-pe smp <num_cores> | reserves <num_cores> which is a number from 1-32 |
-l h_vmem=<num>G | reserves <num> GB of RAM, i.e. -l h_vmem=12G reserves 12GB |
Running MATLAB scripts
If you are going to run a MATLAB script (.m file) with qsub, then you need to make sure that you launch MATLAB first so it can read the .m file. If not, qsub won't know what to do with your script. Here is a sample command:
where m.m is the MATLAB script that you are running. If you notice the command is similar to the option #2 for "Submitting Jobs to the queue", with the difference that you are piping in the command you want to use (i.e. echo matlab -nodisplay -nosplash -nodesktop -r \"run ./m.m\" | ). It is important that you specify the -nodesktop flag for MATLAB, since the cluster isn't run using MATLAB's graphical desktop. Also, this command is being run from the same directory as the m.m file, so if you want this to work from any directory, replace ./ with the path to your .m file.
Monitoring a job on the queue
If you want to see the progress of your job in the queue you can use the following commands:
qstat | to get a quick summary of your job status - either pending ( qw ), running on a machine ( r ), or in the error state ( eqw ) |
qstat -f | to see your jobs in relation to the each machine on the cluster (which is a more detailed overview of qstat). You can also see the status of the cluster with it. |
qstat -f -u \* | to see every job running on the cluster (this can be helpful to see if the queue being used by others, and if you should expect that you will have to wait awhile. |
Deleting a job on the queue
qdel <job-ID> | To delete a specific job |
qdel -u <username> | To delete all of your jobs in the queue |
Job status / Error checking:
qstat -j <job-ID> | To check the status of a job |
This can be useful when the job is in the error queue status ( eqw )
Reserving a cpu core
If you are working on the cluster, but don’t want to submit a job to the queue. You can reserve a cpu core with the following command:
qrsh
The qrsh command assigns to you a machine at random.