HPC Challenge Benchmark¶
1 - Linpack¶
The Linpack benchmark is a measure of computer floating point execution efficiency and is the basis for the Top 500 supercomputer rankings.
1.1 - Usage¶
Example Repo : Linpack
Example Script:
cat <<EOF > batchScript.sh
#!/bin/bash
#SBATCH --partition=arm-kunpeng920
#SBATCH --time=00:25:00
#SBATCH --ntasks=128
#SBATCH --nodes=1
module load GCC/12.1.0 OpenBLAS/0.3.21 OpenMPI/4.1.3
mpirun --allow-run-as-root -npernode 8 -x OMP_NUM_THREADS=16 ./xhpl
EOF
sbatch batchScript.sh
1.2 - Results¶
CPU |
Compiler Combination |
Number of Nodes |
Number of Cores |
Test Result |
---|---|---|---|---|
arm-kunpeng920 |
GCC/12.1.0 |
1 |
128 (16 processes, 8 threads per process) |
3.0346e+02Gflops |
arm-kunpeng920 |
GCC/12.1.0 |
1 |
128 (128 processes, 1 threads per process) |
3.9471e+01Gflops |
arm-kunpeng920 |
GCC/12.1.0 |
1 |
128 (8 processes, 16 threads per process) |
7.2177e+02Gflops |
arm-kunpeng920 |
GCC/12.1.0 |
1 |
128 (4 processes, 32 threads per process) |
6.0959e+02Gflops |
arm-kunpeng920 |
GCC/12.1.0 |
1 |
128 (64 processes, 2 threads per process) |
7.8394e+01Gflops |
2 - STREAM: Sustainable Memory Bandwidth in High Performance Computers¶
2.1 - Usage¶
Example Repo : STREAM
Test Script:
cat <<EOF > batchscript.sh
#!/bin/bash
#SBATCH --partition=arm-kunpeng920
#SBATCH --time=00:10:00
#SBATCH --ntasks=128
#SBATCH --nodes=1
module load GCC/12.1.0 OpenMPI
gcc -fopenmp -O3 -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20 -mcmodel=large stream.c -o stream_c
./stream_c
EOF
sbatch batchscript.sh
2.2 - Results¶
Output:
*Not optimised for performance*
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 80000000 (elements), Offset = 0 (elements)
Memory per array = 610.4 MiB (= 0.6 GiB).
Total memory required = 1831.1 MiB (= 1.8 GiB).
Each kernel will be executed 20 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 128
Number of Threads counted = 128
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 10418 microseconds.
(= 10418 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 141140.7 0.011449 0.009069 0.013668
Scale: 150675.2 0.010752 0.008495 0.013169
Add: 128894.4 0.016744 0.014896 0.020257
Triad: 143679.0 0.016380 0.013363 0.023785
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------