在启智社区基于PyTorch运行国产算力卡的模型训练实验
2023年,在启智社区,我成功地运行了九种基于PyTorch的计算机视觉(CV)模型,包括AlexNet、CBAM、googlenet、ResNet50、ECA_MobileNet_V2、InceptionV3、DPN92、LeNet和SqueezeNet。以下是我的实验步骤和结果分析。
一、数据集准备
为了适应时间和成本效益,我选择了相对较小的imagenet-tiny数据集,其中包含1000个分类,每个分类至少有20张图片。数据集总大小为2.9GB。
数据集地址: imagenet2012_tiny - OpenI
二、创建启智任务
我创建了一个启智智算网络调试任务,配置如下:
- 镜像:
iluvatar-pytorch1.13.1-bi-v100
- 资源规格:
ILUVATAR-GPGPU: 1BI-V100, CPU: 30, 内存: 64GB
- 数据集:
imagenet-tiny
三、加载模型和解压数据集
加载模型
bash
cd
git clone -b openi-task12 https://gitee.com/deep-spark/deepsparkhub
cp -r deepsparkhub /code
解压数据集
请确保在/dataset
目录解压数据集,以避免在远程加载的/code
目录中进行解压,这会大幅减慢解压速度。
bash
cd /dataset
unzip imagenet.zip
四、模型训练步骤和结果展示
1. AlexNet
Step 1: Installing
pip3 install torch pip3 install torchvision
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify
/path/to/imagenet
to your ImageNet path in later training process.The ImageNet dataset path structure should look like:
imagenet ├── train │ └── n01440764 │ ├── n01440764_10026.JPEG │ └── ... ├── train_list.txt ├── val │ └── n01440764 │ ├── ILSVRC2012_val_00000293.JPEG │ └── ... └── val_list.txt
Step 2: Training
cd start_scripts
One single GPU
bash train_alexnet_torch.sh --data-path /path/to/imagenet
总结: AlexNet在5个epoch后达到了Acc@1 0.900和Acc@5 4.400,耗时约四分半钟。
2. CBAM
Model description
Official PyTorch code for “CBAM: Convolutional Block Attention Module (ECCV2018)“
Step 1: Installing
pip3 install torch pip3 install torchvision
Step 2: Training
ResNet50 based examples are included. Example scripts are included under
./scripts/
directory. ImageNet data should be included under./data/ImageNet/
with foler namedtrain
andval
.# To train with CBAM (ResNet50 backbone) # For 8 GPUs python3 train_imagenet.py --ngpu 8 --workers 20 --arch resnet --depth 50 --epochs 100 --batch-size 256 --lr 0.1 --att-type CBAM --prefix RESNET50_IMAGENET_CBAM ./data/ImageNet # For 1 GPUs python3 train_imagenet.py --ngpu 1 --workers 20 --arch resnet --depth 50 --epochs 100 --batch-size 64 --lr 0.1 --att-type CBAM --prefix RESNET50_IMAGENET_CBAM ./data/ImageNet
总结: CBAM的精度明显高于AlexNet,耗时约六分钟。
3. ResNet50
Step 1: Installing
pip3 install -r requirements.txt
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify
/path/to/imagenet
to your ImageNet path in later training process.The ImageNet dataset path structure should look like:
imagenet ├── train │ └── n01440764 │ ├── n01440764_10026.JPEG │ └── ... ├── train_list.txt ├── val │ └── n01440764 │ ├── ILSVRC2012_val_00000293.JPEG │ └── ... └── val_list.txt
Step 2: Training
Multiple GPUs on one machine (AMP)
Set data path by
export DATA_PATH=/path/to/imagenet
. The following command uses all cards to train:bash train_resnest50_amp_dist.sh
总结: ResNet50的训练速度较快,耗时约五分钟,精度低于AlexNet。
4. googlenet
Step 1: Installing
pip3 install torch pip3 install torchvision
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify
/path/to/imagenet
to your ImageNet path in later training process.The ImageNet dataset path structure should look like:
imagenet ├── train │ └── n01440764 │ ├── n01440764_10026.JPEG │ └── ... ├── train_list.txt ├── val │ └── n01440764 │ ├── ILSVRC2012_val_00000293.JPEG │ └── ... └── val_list.txt
Step 2: Training
One single GPU
python3 train.py --data-path /path/to/imagenet --model googlenet --batch-size 512
5. ECA_MobileNet_V2
Step 1: Installing
pip3 install -r requirements.txt
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify
/path/to/imagenet
to your ImageNet path in later training process.The ImageNet dataset path structure should look like:
imagenet ├── train │ └── n01440764 │ ├── n01440764_10026.JPEG │ └── ... ├── train_list.txt ├── val │ └── n01440764 │ ├── ILSVRC2012_val_00000293.JPEG │ └── ... └── val_list.txt
Step 2: Training
Multiple GPUs on one machine (AMP)
Set data path by
export DATA_PATH=/path/to/imagenet
. The following command uses all cards to train:bash train_eca_mobilenet_v2_amp_dist.sh
6. InceptionV3
Step 1: Installing
pip3 install -r requirements.txt
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify
/path/to/imagenet
to your ImageNet path in later training process.The ImageNet dataset path structure should look like:
imagenet ├── train │ └── n01440764 │ ├── n01440764_10026.JPEG │ └── ... ├── train_list.txt ├── val │ └── n01440764 │ ├── ILSVRC2012_val_00000293.JPEG │ └── ... └── val_list.txt
Step 2: Training
Multiple GPUs on one machine (AMP)
Set data path by
export DATA_PATH=/path/to/imagenet
. The following command uses all cards to train:bash train_inception_v3_amp_dist.sh
7. DPN92
Step 1: Installing
pip3 install -r requirements.txt
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify
/path/to/imagenet
to your ImageNet path in later training process.The ImageNet dataset path structure should look like:
imagenet ├── train │ └── n01440764 │ ├── n01440764_10026.JPEG │ └── ... ├── train_list.txt ├── val │ └── n01440764 │ ├── ILSVRC2012_val_00000293.JPEG │ └── ... └── val_list.txt
Step2: Training
Multiple GPUs on one machine (AMP)
Set data path by
export DATA_PATH=/path/to/imagenet
. The following command uses all cards to train:bash train_dpn92_amp_dist.sh
8. LeNet
Step 1: Installing
pip3 install torch pip3 install torchvision
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify
/path/to/imagenet
to your ImageNet path in later training process.The ImageNet dataset path structure should look like:
imagenet ├── train │ └── n01440764 │ ├── n01440764_10026.JPEG │ └── ... ├── train_list.txt ├── val │ └── n01440764 │ ├── ILSVRC2012_val_00000293.JPEG │ └── ... └── val_list.txt
Step 2: Training
One single GPU
python3 train.py --data-path /path/to/imagenet --model lenet
9. SqueezeNet
Step 1: Installing
pip3 install torch torchvision
Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify
/path/to/imagenet
to your ImageNet path in later training process.The ImageNet dataset path structure should look like:
imagenet ├── train │ └── n01440764 │ ├── n01440764_10026.JPEG │ └── ... ├── train_list.txt ├── val │ └── n01440764 │ ├── ILSVRC2012_val_00000293.JPEG │ └── ... └── val_list.txt
Step 2: Training
One single GPU
python3 train.py --data-path /path/to/imagenet --model squeezenet1_0 --lr 0.001
五、天数智芯GPGPU使用总结和建议
通过这些实验,我发现天数智芯GPGPU与PyTorch框架兼容性良好,无需特别更改CUDA类型,对现有模型具有良好的适配性。它是一个非常适合作为NVIDIA替代品的国产算力解决方案。我对国产算力的未来发展充满期待。