在启智社区基于PyTorch运行国产算力卡的模型训练实验


在启智社区基于PyTorch运行国产算力卡的模型训练实验

2023年,在启智社区,我成功地运行了九种基于PyTorch的计算机视觉(CV)模型,包括AlexNet、CBAM、googlenet、ResNet50、ECA_MobileNet_V2、InceptionV3、DPN92、LeNet和SqueezeNet。以下是我的实验步骤和结果分析。

一、数据集准备

为了适应时间和成本效益,我选择了相对较小的imagenet-tiny数据集,其中包含1000个分类,每个分类至少有20张图片。数据集总大小为2.9GB。

数据集地址: imagenet2012_tiny - OpenI

二、创建启智任务

我创建了一个启智智算网络调试任务,配置如下:

  • 镜像:iluvatar-pytorch1.13.1-bi-v100
  • 资源规格:ILUVATAR-GPGPU: 1BI-V100, CPU: 30, 内存: 64GB
  • 数据集:imagenet-tiny

三、加载模型和解压数据集

加载模型

bash
cd
git clone -b openi-task12 https://gitee.com/deep-spark/deepsparkhub
cp -r deepsparkhub /code

解压数据集

请确保在/dataset目录解压数据集,以避免在远程加载的/code目录中进行解压,这会大幅减慢解压速度。

bash
cd /dataset
unzip imagenet.zip

四、模型训练步骤和结果展示

1. AlexNet

  • Step 1: Installing

    pip3 install torch
    pip3 install torchvision
    

    Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

    The ImageNet dataset path structure should look like:

    imagenet
    ├── train
    │   └── n01440764
    │       ├── n01440764_10026.JPEG
    │       └── ...
    ├── train_list.txt
    ├── val
    │   └── n01440764
    │       ├── ILSVRC2012_val_00000293.JPEG
    │       └── ...
    └── val_list.txt
    

    Step 2: Training

    cd start_scripts
    

    One single GPU

    bash train_alexnet_torch.sh --data-path /path/to/imagenet
    
  • 总结: AlexNet在5个epoch后达到了Acc@1 0.900和Acc@5 4.400,耗时约四分半钟。

2. CBAM

  • Model description

    Official PyTorch code for “CBAM: Convolutional Block Attention Module (ECCV2018)

    Step 1: Installing

    pip3 install torch
    pip3 install torchvision
    

    Step 2: Training

    ResNet50 based examples are included. Example scripts are included under ./scripts/ directory. ImageNet data should be included under ./data/ImageNet/ with foler named train and val.

    # To train with CBAM (ResNet50 backbone)
    # For 8 GPUs
    python3 train_imagenet.py --ngpu 8 --workers 20 --arch resnet --depth 50 --epochs 100 --batch-size 256 --lr 0.1 --att-type CBAM --prefix RESNET50_IMAGENET_CBAM ./data/ImageNet
    # For 1 GPUs
    python3 train_imagenet.py --ngpu 1 --workers 20 --arch resnet --depth 50 --epochs 100 --batch-size 64 --lr 0.1 --att-type CBAM --prefix RESNET50_IMAGENET_CBAM ./data/ImageNet
    
  • 总结: CBAM的精度明显高于AlexNet,耗时约六分钟。

3. ResNet50

  • Step 1: Installing

    pip3 install -r requirements.txt
    

    Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

    The ImageNet dataset path structure should look like:

    imagenet
    ├── train
    │   └── n01440764
    │       ├── n01440764_10026.JPEG
    │       └── ...
    ├── train_list.txt
    ├── val
    │   └── n01440764
    │       ├── ILSVRC2012_val_00000293.JPEG
    │       └── ...
    └── val_list.txt
    

    Step 2: Training

    Multiple GPUs on one machine (AMP)

    Set data path by export DATA_PATH=/path/to/imagenet. The following command uses all cards to train:

    bash train_resnest50_amp_dist.sh
    
  • 总结: ResNet50的训练速度较快,耗时约五分钟,精度低于AlexNet。

4. googlenet

  • Step 1: Installing

    pip3 install torch
    pip3 install torchvision
    

    Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

    The ImageNet dataset path structure should look like:

    imagenet
    ├── train
    │   └── n01440764
    │       ├── n01440764_10026.JPEG
    │       └── ...
    ├── train_list.txt
    ├── val
    │   └── n01440764
    │       ├── ILSVRC2012_val_00000293.JPEG
    │       └── ...
    └── val_list.txt
    

    Step 2: Training

    One single GPU

    python3 train.py --data-path /path/to/imagenet --model googlenet --batch-size 512
    

5. ECA_MobileNet_V2

  • Step 1: Installing

    pip3 install -r requirements.txt
    

    Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

    The ImageNet dataset path structure should look like:

    imagenet
    ├── train
    │   └── n01440764
    │       ├── n01440764_10026.JPEG
    │       └── ...
    ├── train_list.txt
    ├── val
    │   └── n01440764
    │       ├── ILSVRC2012_val_00000293.JPEG
    │       └── ...
    └── val_list.txt
    

    Step 2: Training

    Multiple GPUs on one machine (AMP)

    Set data path by export DATA_PATH=/path/to/imagenet. The following command uses all cards to train:

    bash train_eca_mobilenet_v2_amp_dist.sh
    

6. InceptionV3

  • Step 1: Installing

    pip3 install -r requirements.txt
    

    Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

    The ImageNet dataset path structure should look like:

    imagenet
    ├── train
    │   └── n01440764
    │       ├── n01440764_10026.JPEG
    │       └── ...
    ├── train_list.txt
    ├── val
    │   └── n01440764
    │       ├── ILSVRC2012_val_00000293.JPEG
    │       └── ...
    └── val_list.txt
    

    Step 2: Training

    Multiple GPUs on one machine (AMP)

    Set data path by export DATA_PATH=/path/to/imagenet. The following command uses all cards to train:

    bash train_inception_v3_amp_dist.sh
    

7. DPN92

  • Step 1: Installing

    pip3 install -r requirements.txt
    

    Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

    The ImageNet dataset path structure should look like:

    imagenet
    ├── train
    │   └── n01440764
    │       ├── n01440764_10026.JPEG
    │       └── ...
    ├── train_list.txt
    ├── val
    │   └── n01440764
    │       ├── ILSVRC2012_val_00000293.JPEG
    │       └── ...
    └── val_list.txt
    

    Step2: Training

    Multiple GPUs on one machine (AMP)

    Set data path by export DATA_PATH=/path/to/imagenet. The following command uses all cards to train:

    bash train_dpn92_amp_dist.sh
    

8. LeNet

  • Step 1: Installing

    pip3 install torch
    pip3 install torchvision
    

    Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

    The ImageNet dataset path structure should look like:

    imagenet
    ├── train
    │   └── n01440764
    │       ├── n01440764_10026.JPEG
    │       └── ...
    ├── train_list.txt
    ├── val
    │   └── n01440764
    │       ├── ILSVRC2012_val_00000293.JPEG
    │       └── ...
    └── val_list.txt
    

    Step 2: Training

    One single GPU

    python3 train.py --data-path /path/to/imagenet --model lenet 
    

9. SqueezeNet

  • Step 1: Installing

    pip3 install torch torchvision
    

    Sign up and login in ImageNet official website, then choose ‘Download’ to download the whole ImageNet dataset. Specify /path/to/imagenet to your ImageNet path in later training process.

    The ImageNet dataset path structure should look like:

    imagenet
    ├── train
    │   └── n01440764
    │       ├── n01440764_10026.JPEG
    │       └── ...
    ├── train_list.txt
    ├── val
    │   └── n01440764
    │       ├── ILSVRC2012_val_00000293.JPEG
    │       └── ...
    └── val_list.txt
    

    Step 2: Training

    One single GPU

    python3 train.py --data-path /path/to/imagenet --model squeezenet1_0 --lr 0.001
    

五、天数智芯GPGPU使用总结和建议

通过这些实验,我发现天数智芯GPGPU与PyTorch框架兼容性良好,无需特别更改CUDA类型,对现有模型具有良好的适配性。它是一个非常适合作为NVIDIA替代品的国产算力解决方案。我对国产算力的未来发展充满期待。


yg9538 2024年9月7日 22:12 1603 收藏文档