Pytorch Environment Setup & User Guide

1 - Training

Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. Huawei Ascend 910 Chips mainly developed for high performance training but also supports inference.

1.1 - Pytorch-v1.5.0

The modules, model and so on information used during the PyTorch-v1.5.0 training performed on the specified cluster are given below.

1.1.1 - Module Load Environment preparation:

module load GCC/9.5.0 OpenMPI PyTorch-CANN/1.5.0

Note: If you want to train model with mixed precision, you also need to load Apex module

module load apex

1.1.2 Model Training

Training Script:

module cat <<EOF > batchscript.sh
#!/bin/bash
#SBATCH --partition=a800-9000
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
#Displays avaliablity of NPU's > npu-smi info
<Training command>
EOF

sbatch batchscript.sh

1.1.3 Example Usage

Environment Preperation:

module load GCC/9.5.0 OpenMPI PyTorch-CANN/1.5.0 apex
git clone https://gitee.com/tianyu__zhou/pytorch_lenet_on_npu.git
cd pytorch_lenet_on_npu

Training Script:

module cat <<EOF > batchscript.sh
#!/bin/bash
#SBATCH --partition=a800-9000
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
npu-smi info
export RANK_SIZE=1
python3 train_npu.py --epochs 10 --batch-size 64 --device_id 0
EOF

Run the Script:

sbatch batchscript.sh
>>> Submitted batch job 1079
cat slurm-1079.out
>>> [...]

2 - Inference

Model inference is the process of using a trained model to infer a result from live data. Ascend 310 chips supports only inference.

Note::
  • A310 chips are developed much smaller than A910 chips to bring inference solutions in real life more easily with lesser power consumption with more affordable price. You can discover more from the link .

2.1 - Online Inference

Online inference means, running an operation without converting model. It supports to inference Tensorflow, Pytorch and Mindspore models in original form. While Huawei A310 chips supports Tensorflow and Mindspore models for online inference, A910 chips supports Tensorflow, Pytorch and Mindspore models.

2.1.1 - Pytorch-v1.5.0

The modules, model and so on information used during the PyTorch-v1.5.0 Online Inference performed on the specified cluster are given below.

2.1.2 - Module Load

Environment Preperation:

module load GCC/9.5.0 OpenMPI PyTorch-CANN/1.5.0

2.1.3 - Model Inference

Inference Script:

module cat <<EOF > batchscript.sh
#!/bin/bash
#SBATCH --partition=a800-9000
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
npu-smi info
<Inference command>
EOF

sbatch batchscript.sh

2.1.4 - Example Usage

Example Repo: ResNet-50

Environmental Preperation::

module load GCC/9.5.0 OpenMPI PyTorch-CANN/1.5.0
  • The code we will use for inference is in the readme. For this, we need to open a new python file and copy the code there.

    vim resnet50_infer_for_pytorch.py #paste the code in here

  • Visit Ascend ModelZoo and click Download Model to download a pre-trained ResNet-50 model.

Inference Script:

module cat <<EOF > batchscript.sh
#!/bin/bash
#SBATCH --partition=a800-9000
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
npu-smi info
python3 resnet50_infer_for_pytorch.py --data ./data/ --npu 0 --epochs 90 --resume ./ResNet50_for_Pytorch_1.4_model/resnet50_pytorch_1.4.pth.tar
EOF

Run the script:

sbatch batchscript.sh
>>> Submitted batch job 1079
cat slurm-1079.out
>>> [...]