============================================================================== Pytorch Environment Setup & User Guide ============================================================================== 1 - Training ============================================================================== Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. Huawei Ascend 910 Chips mainly developed for high performance training but also supports inference. 1.1 - Pytorch-v1.5.0 ----------------------- The modules, model and so on information used during the PyTorch-v1.5.0 training performed on the specified cluster are given below. 1.1.1 - Module Load Environment preparation:: module load GCC/9.5.0 OpenMPI PyTorch-CANN/1.5.0 Note: If you want to train model with mixed precision, you also need to load Apex module :: module load apex 1.1.2 Model Training ----------------------- Training Script:: module cat < batchscript.sh #!/bin/bash #SBATCH --partition=a800-9000 #SBATCH --time=00:10:00 #SBATCH --ntasks=1 #SBATCH --nodes=1 #Displays avaliablity of NPU's > npu-smi info EOF sbatch batchscript.sh 1.1.3 Example Usage ----------------------- - Example Repo : LENET_ .. _LENET: https://gitee.com/tianyu__zhou/pytorch_lenet_on_npu Environment Preperation:: module load GCC/9.5.0 OpenMPI PyTorch-CANN/1.5.0 apex git clone https://gitee.com/tianyu__zhou/pytorch_lenet_on_npu.git cd pytorch_lenet_on_npu Training Script:: module cat < batchscript.sh #!/bin/bash #SBATCH --partition=a800-9000 #SBATCH --time=00:10:00 #SBATCH --ntasks=1 #SBATCH --nodes=1 npu-smi info export RANK_SIZE=1 python3 train_npu.py --epochs 10 --batch-size 64 --device_id 0 EOF Run the Script:: sbatch batchscript.sh >>> Submitted batch job 1079 cat slurm-1079.out >>> [...] 2 - Inference ============================================================================== Model inference is the process of using a trained model to infer a result from live data. Ascend 310 chips supports only inference. Note:: - A310 chips are developed much smaller than A910 chips to bring inference solutions in real life more easily with lesser power consumption with more affordable price. You can discover more from the link_ . .. _link: https://www.hiascend.com/hardware/product 2.1 - Online Inference ----------------------- Online inference means, running an operation without converting model. It supports to inference Tensorflow, Pytorch and Mindspore models in original form. While Huawei A310 chips supports Tensorflow and Mindspore models for online inference, A910 chips supports Tensorflow, Pytorch and Mindspore models. 2.1.1 - Pytorch-v1.5.0 ----------------------- The modules, model and so on information used during the PyTorch-v1.5.0 Online Inference performed on the specified cluster are given below. 2.1.2 - Module Load ----------------------- Environment Preperation:: module load GCC/9.5.0 OpenMPI PyTorch-CANN/1.5.0 2.1.3 - Model Inference ----------------------- Inference Script:: module cat < batchscript.sh #!/bin/bash #SBATCH --partition=a800-9000 #SBATCH --time=00:10:00 #SBATCH --ntasks=1 #SBATCH --nodes=1 npu-smi info EOF sbatch batchscript.sh 2.1.4 - Example Usage ----------------------- Example Repo: ResNet_-50 .. _ResNet: https://gitee.com/ascend/pytorch/blob/master/docs/en/PyTorch%20Online%20Inference%20Guide/PyTorch%20Online%20Inference%20Guide.md#sample-code Environmental Preperation::: module load GCC/9.5.0 OpenMPI PyTorch-CANN/1.5.0 - The code we will use for inference is in the readme. For this, we need to open a new python file and copy the code there. vim resnet50_infer_for_pytorch.py #paste the code in here - Visit Ascend ModelZoo_ and click Download Model to download a pre-trained ResNet-50 model. .. _ModelZoo: https://www.hiascend.com/software/modelzoo Inference Script:: module cat < batchscript.sh #!/bin/bash #SBATCH --partition=a800-9000 #SBATCH --time=00:10:00 #SBATCH --ntasks=1 #SBATCH --nodes=1 npu-smi info python3 resnet50_infer_for_pytorch.py --data ./data/ --npu 0 --epochs 90 --resume ./ResNet50_for_Pytorch_1.4_model/resnet50_pytorch_1.4.pth.tar EOF Run the script:: sbatch batchscript.sh >>> Submitted batch job 1079 cat slurm-1079.out >>> [...]