在启智社区基于PyTorch运行国产算力卡的模型训练实验

2023年，在启智社区，我成功地运行了九种基于PyTorch的计算机视觉（CV）模型，包括AlexNet、CBAM、googlenet、ResNet50、ECA_MobileNet_V2、InceptionV3、DPN92、LeNet和SqueezeNet。以下是我的实验步骤和结果分析。

一、数据集准备

为了适应时间和成本效益，我选择了相对较小的imagenet-tiny数据集，其中包含1000个分类，每个分类至少有20张图片。数据集总大小为2.9GB。

数据集地址: imagenet2012_tiny - OpenI

二、创建启智任务

我创建了一个启智智算网络调试任务，配置如下：

镜像：iluvatar-pytorch1.13.1-bi-v100
资源规格：ILUVATAR-GPGPU: 1BI-V100, CPU: 30, 内存: 64GB
数据集：imagenet-tiny

三、加载模型和解压数据集

加载模型

bash
cd
git clone -b openi-task12 https://gitee.com/deep-spark/deepsparkhub
cp -r deepsparkhub /code

解压数据集

请确保在/dataset目录解压数据集，以避免在远程加载的/code目录中进行解压，这会大幅减慢解压速度。

bash
cd /dataset
unzip imagenet.zip

四、模型训练步骤和结果展示

1. AlexNet

Step 1: Installing

pip3 install torch
pip3 install torchvision

Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

The ImageNet dataset path structure should look like:

imagenet
├── train
│   └── n01440764
│       ├── n01440764_10026.JPEG
│       └── ...
├── train_list.txt
├── val
│   └── n01440764
│       ├── ILSVRC2012_val_00000293.JPEG
│       └── ...
└── val_list.txt

Step 2: Training

cd start_scripts

One single GPU

bash train_alexnet_torch.sh --data-path /path/to/imagenet

总结: AlexNet在5个epoch后达到了Acc@1 0.900和Acc@5 4.400，耗时约四分半钟。

2. CBAM

Model description

Official PyTorch code for “CBAM: Convolutional Block Attention Module (ECCV2018)“

Step 1: Installing

pip3 install torch
pip3 install torchvision

Step 2: Training

ResNet50 based examples are included. Example scripts are included under ./scripts/ directory. ImageNet data should be included under ./data/ImageNet/ with foler named train and val.

# To train with CBAM (ResNet50 backbone)
# For 8 GPUs
python3 train_imagenet.py --ngpu 8 --workers 20 --arch resnet --depth 50 --epochs 100 --batch-size 256 --lr 0.1 --att-type CBAM --prefix RESNET50_IMAGENET_CBAM ./data/ImageNet
# For 1 GPUs
python3 train_imagenet.py --ngpu 1 --workers 20 --arch resnet --depth 50 --epochs 100 --batch-size 64 --lr 0.1 --att-type CBAM --prefix RESNET50_IMAGENET_CBAM ./data/ImageNet

总结: CBAM的精度明显高于AlexNet，耗时约六分钟。

3. ResNet50

Step 1: Installing
```
pip3 install -r requirements.txt
```
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

The ImageNet dataset path structure should look like:
```
imagenet
├── train
│   └── n01440764
│       ├── n01440764_10026.JPEG
│       └── ...
├── train_list.txt
├── val
│   └── n01440764
│       ├── ILSVRC2012_val_00000293.JPEG
│       └── ...
└── val_list.txt
```
Step 2: Training
Multiple GPUs on one machine (AMP)
Set data path by export DATA_PATH=/path/to/imagenet. The following command uses all cards to train:
```
bash train_resnest50_amp_dist.sh
```
总结: ResNet50的训练速度较快，耗时约五分钟，精度低于AlexNet。

4. googlenet

Step 1: Installing

pip3 install torch
pip3 install torchvision

Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

The ImageNet dataset path structure should look like:

imagenet
├── train
│   └── n01440764
│       ├── n01440764_10026.JPEG
│       └── ...
├── train_list.txt
├── val
│   └── n01440764
│       ├── ILSVRC2012_val_00000293.JPEG
│       └── ...
└── val_list.txt

Step 2: Training

One single GPU

python3 train.py --data-path /path/to/imagenet --model googlenet --batch-size 512

5. ECA_MobileNet_V2

Step 1: Installing
```
pip3 install -r requirements.txt
```
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

The ImageNet dataset path structure should look like:
```
imagenet
├── train
│   └── n01440764
│       ├── n01440764_10026.JPEG
│       └── ...
├── train_list.txt
├── val
│   └── n01440764
│       ├── ILSVRC2012_val_00000293.JPEG
│       └── ...
└── val_list.txt
```
Step 2: Training
Multiple GPUs on one machine (AMP)
Set data path by export DATA_PATH=/path/to/imagenet. The following command uses all cards to train:
```
bash train_eca_mobilenet_v2_amp_dist.sh
```

6. InceptionV3

Step 1: Installing
```
pip3 install -r requirements.txt
```
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

The ImageNet dataset path structure should look like:
```
imagenet
├── train
│   └── n01440764
│       ├── n01440764_10026.JPEG
│       └── ...
├── train_list.txt
├── val
│   └── n01440764
│       ├── ILSVRC2012_val_00000293.JPEG
│       └── ...
└── val_list.txt
```
Step 2: Training
Multiple GPUs on one machine (AMP)
Set data path by export DATA_PATH=/path/to/imagenet. The following command uses all cards to train:
```
bash train_inception_v3_amp_dist.sh
```

7. DPN92

Step 1: Installing
```
pip3 install -r requirements.txt
```
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

The ImageNet dataset path structure should look like:
```
imagenet
├── train
│   └── n01440764
│       ├── n01440764_10026.JPEG
│       └── ...
├── train_list.txt
├── val
│   └── n01440764
│       ├── ILSVRC2012_val_00000293.JPEG
│       └── ...
└── val_list.txt
```
Step2: Training
Multiple GPUs on one machine (AMP)
Set data path by export DATA_PATH=/path/to/imagenet. The following command uses all cards to train:
```
bash train_dpn92_amp_dist.sh
```

8. LeNet

Step 1: Installing

pip3 install torch
pip3 install torchvision

Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

The ImageNet dataset path structure should look like:

imagenet
├── train
│   └── n01440764
│       ├── n01440764_10026.JPEG
│       └── ...
├── train_list.txt
├── val
│   └── n01440764
│       ├── ILSVRC2012_val_00000293.JPEG
│       └── ...
└── val_list.txt

Step 2: Training

One single GPU

python3 train.py --data-path /path/to/imagenet --model lenet

9. SqueezeNet

Step 1: Installing

pip3 install torch torchvision

Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

The ImageNet dataset path structure should look like:

imagenet
├── train
│   └── n01440764
│       ├── n01440764_10026.JPEG
│       └── ...
├── train_list.txt
├── val
│   └── n01440764
│       ├── ILSVRC2012_val_00000293.JPEG
│       └── ...
└── val_list.txt

Step 2: Training

One single GPU

python3 train.py --data-path /path/to/imagenet --model squeezenet1_0 --lr 0.001

五、天数智芯GPGPU使用总结和建议

通过这些实验，我发现天数智芯GPGPU与PyTorch框架兼容性良好，无需特别更改CUDA类型，对现有模型具有良好的适配性。它是一个非常适合作为NVIDIA替代品的国产算力解决方案。我对国产算力的未来发展充满期待。

在启智社区基于PyTorch运行国产算力卡的模型训练实验

在启智社区基于PyTorch运行国产算力卡的模型训练实验

一、数据集准备

二、创建启智任务

三、加载模型和解压数据集

加载模型

解压数据集

四、模型训练步骤和结果展示

1. AlexNet

Step 1: Installing

Step 2: Training

One single GPU

2. CBAM

Model description

Step 1: Installing

Step 2: Training

3. ResNet50

Step 1: Installing

Step 2: Training

Multiple GPUs on one machine (AMP)

4. googlenet

Step 1: Installing

Step 2: Training

One single GPU

5. ECA_MobileNet_V2

Step 1: Installing

Step 2: Training

Multiple GPUs on one machine (AMP)

6. InceptionV3

Step 1: Installing

Step 2: Training

Multiple GPUs on one machine (AMP)

7. DPN92

Step 1: Installing

Step2: Training

Multiple GPUs on one machine (AMP)

8. LeNet

Step 1: Installing

Step 2: Training

One single GPU

9. SqueezeNet

Step 1: Installing

Step 2: Training

One single GPU

五、天数智芯GPGPU使用总结和建议