Cluster usage¶
Software stack overview¶
HAICGU uses a software stack based on EasyBuild. The recipes for building software (EasyConfigs) can be found here.
The built software is made accessible to users as modules with LMod.
Access to compute nodes is provided using the SLURM workload manager.
Using modules¶
When on the dev node (guoehi-dev
), list the available modules by typing:
module avail
You will see the following output:
$ module avail
[...]
--------------------- Compilers in Stage 2022a ----------------------
BiSheng-compiler/2.3.0 GCC/9.5.0 GCC/12.1.0 (D) armlinux/22.0.1
-------------------- Core modules in Stage 2022a --------------------
EasyBuild/4.5.5 alplompi/22.0.1 gompi/2022a.12 (D) tmux/3.3a
Java/8.292.10 armlinux-install/22.0.1 goolf/2022a.9 zsh/5.8.1
Java/11.0.15 (D) flex/2.6.4 goolf/2022a.12 (D)
alompi/22.0.1 gompi/2022a.9 help2man/1.49.2
--------------------------- Architectures ---------------------------
Architecture/Kunpeng920 (S) Architecture/somearch (S,D)
-------------------------- Custom modules ---------------------------
arm-optimized-routines/21.02 (L)
Where:
D: Default Module
L: Module is loaded
S: Module is Sticky, requires --force to unload or purge
[...]
You can load modules with module load ModuleName
.
The modules are organized hierarchically - after loading a compiler, more modules will become available:
$ module load GCC/12.1.0
$ module avail
[...]
--------------- MPI runtimes available for GCC 12.1.0 ---------------
OpenMPI/4.1.3
----------------- Modules compiled with GCC 12.1.0 ------------------
Autotools/20220509 absl-py/1.0.0-Python-3.10.4
Bazel/4.2.2 c-ares/1.18.1
Bazel/5.1.1 (D) cURL/7.83.0
BazelWIT/0.26.1 dm-tree/0.1.7-Python-3.10.4
CMake/3.23.1 double-conversion/3.2.0
Eigen/3.4.0 flatbuffers-python/2.0-Python-3.10.4
GMP/6.2.1 flatbuffers/2.0.0
JsonCpp/1.9.5 flex/2.6.4 (D)
Meson/0.62.1-Python-3.10.4 giflib/5.2.1
Ninja/1.10.2 git/2.36.1
OpenBLAS/0.3.20 help2man/1.49.2 (D)
Perl/5.34.1 hwloc/2.7.1
Pillow/9.1.1-Python-3.10.4 libffi/3.4.2
PostgreSQL/14.2 libyaml/0.2.5
PyYAML/6.0-Python-3.10.4 lz4/1.9.3
Python/3.10.4 nghttp2/1.47.0
Rust/1.60.0 nsync/1.24.0
Tcl/8.6.12 numactl/2.0.14
UCX/1.12.1 protobuf-python/3.20.1-Python-3.10.4
X11/20220509 ray-deps/1.12.0-Python-3.10.4
Zip/3.0 unzip/6.0
abseil-cpp/20210324.1
[...]
And then after loading an MPI runtime (Currently only OpenMPI), the rest of the modules will become visible:
[...]
---------- Modules built with GCC 12.1.0 and OpenMPI 4.1.3 ----------
Arrow/7.0.0-Python-3.10.4 SciPy-Stack/2022a-Python-3.10.4
Boost/1.79.0-Python-3.10.4 bokeh/2.4.2-Python-3.10.4
FFTW/3.3.10 dask/2022.5.0-Python-3.10.4
HDF5/1.12.2 h5py/3.6.0-Python-3.10.4
ScaLAPACK/2.2.0-OpenBLAS-0.3.20 ray-project/1.12.0-Python-3.10.4
[...]
Note
There are multiple software stages available (2021a, 2022a), but only the current stage is supported (currently 2022a). You can use load a different stage with
. /software/switch_stage.sh -s <stage>
AI software stack¶
The AI software stack has been partly integrated into the EasyBuild software stack, it is available with GCC 9.5.0. Load:
module load GCC/9.5.0 OpenMPI CANN-Toolkit
This will set the necessary environment variables to use the CANN toolkit (AscendCL, …).
You can then load NPU-accelerated AI frameworks.
For TensorFlow 1.15.0 please load:
module load TensorFlow-CANN/1.15.0
For TensorFlow 2.4.1 please load:
module load TensorFlow-CANN/2.4.1
For PyTorch 1.5.0 please load:
module load PyTorch-CANN/1.5.0
Warning
Loading multiple Frameworks or Framework versions at the same time can lead to issues, please make sure to unload one framework with module unload <framework module>
before loading another
Using SLURM¶
In order to run your application on the actual compute nodes, you will need to submit jobs using SLURM.
List information about the available partitions and nodes with sinfo
:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cn-ib* up infinite 10 idle cn[09-18]
cn-eth up infinite 10 idle cn[19-28]
cn-kube up infinite 10 idle cn[1-8]
a800-9000 up infinite 1 idle ml01
a800-3000 up infinite 1 idle ml02
As you can see, currently there are 3 partitions available:
cn-ib
, currently consisting of 10 standard compute nodescn[09-18]
that uses Infiniband for networkcn-eth
, currently consisting of 10 standard compute nodescn[19-28]
that uses Ethernet (ROCE) for networkcn-kube
, currently consisting of 8 standard compute nodescn[1-8]
for Kubernetes. DOnt use for batch jobsa800-9000
, currently consisting of 1 Atlas 800 Training Server (Model: 9000) nodeml01
a800-3000
, currently consisting of 1 Atlas 800 Inference Server (Model: 3000) nodeml02
You can submit jobs using either the srun
or sbatch
commands.
srun
is used to run commands directly:
$ srun -p arm-kunpeng920 hostname
cn01.guoehi.cluster
sbatch
is used to run batch scripts:
$ cat <<EOF > batchscript.sh
> #!/bin/bash
> #SBATCH --partition=a800-9000
> #SBATCH --time=00:01:00
> #SBATCH --ntasks=1
> #SBATCH --nodes=1
> npu-smi info
> EOF
$ sbatch batchscript.sh
Submitted batch job 595
$ cat slurm-595.out
+------------------------------------------------------------------------------------+
| npu-smi 1.8.21 Version: 20.2.2.spc001 |
+----------------------+---------------+---------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) |
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+======================+===============+=============================================+
| 0 910A | OK | 68.6 36 |
| 0 | 0000:C1:00.0 | 0 591 / 14795 0 / 32768 |
+======================+===============+=============================================+
| 1 910A | OK | 63.7 31 |
| 0 | 0000:81:00.0 | 0 303 / 15177 0 / 32768 |
+======================+===============+=============================================+
| 2 910A | OK | 66.1 31 |
| 0 | 0000:41:00.0 | 0 1821 / 15177 0 / 32768 |
+======================+===============+=============================================+
| 3 910A | OK | 65.7 37 |
| 0 | 0000:01:00.0 | 0 3168 / 15088 0 / 32768 |
+======================+===============+=============================================+
| 4 910A | OK | 66.7 35 |
| 0 | 0000:C2:00.0 | 0 295 / 14795 0 / 32768 |
+======================+===============+=============================================+
| 5 910A | OK | 63.7 29 |
| 0 | 0000:82:00.0 | 0 455 / 15177 0 / 32768 |
+======================+===============+=============================================+
| 6 910A | OK | 66.1 29 |
| 0 | 0000:42:00.0 | 0 1517 / 15177 0 / 32768 |
+======================+===============+=============================================+
| 7 910A | OK | 65.1 36 |
| 0 | 0000:02:00.0 | 0 3319 / 15088 0 / 32768 |
+======================+===============+=============================================+
You can view the queued jobs by calling squeue
:
$ cat <<EOF > batchscript.sh
> #!/bin/bash
> #SBATCH --partition=a800-9000
> #SBATCH --time=00:01:00
> #SBATCH --ntasks=1
> #SBATCH --nodes=1
> echo waiting
> sleep 5
> echo finished waiting
> EOF
$ sbatch batchscript.sh
Submitted batch job 597
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
597 a800-9000 batchscr snassyr R 0:01 1 ml01
For more information on how to use SLURM, please read the documentation
Other software¶
ArmIE:¶
To make the module available please use:
$ module use /software/tools/armie-22.0/modulefiles
You can then load the module with:
$ module load armie