SLaK
[ICLR 2023] "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity"; [ICML 2023] "Are Large Kernels Better Teachers than Transformers for ConvNets?"
Install / Use
/learn @VITA-Group/SLaKREADME
Sparse Large Kernel Network - SLaK
Official PyTorch implementation of
(1) More ConvNets in the 2020s: Scaling up Kernels Beyond 51 x 51 using Sparsity, ICLR 2023.
Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao Xiao, Boqian Wu, Mykola Pechenizkiy, Decebal Mocanu, Zhangyang Wang
(2) Are Large Kernels Better Teachers than Transformers for ConvNets?, ICML 2023.
Tianjin Huang, Lu Yin, Zhenyu Zhang, Li Shen, Meng Fang, Mykola Pechenizkiy, Zhangyang Wang, Shiwei Liu
<p align="center"> <img src="https://github.com/Shiweiliuiiiiiii/SLaK/blob/main/SLaK.png" width="500" height="300"> </p>
We propose SLaK, a pure ConvNet model that for the first time is able to scale the convolutional kernels beyond 51x51.
Table of contents
- Installation
- Results of SLaK
- Results of large-2-small kernel Distillation
- Training of SLaK
- Downstream Transfer Code for Semantic Segmentation and Object Detection
- Training of large-2-small kernel distillation
Results and ImageNet-1K trained models
SLaK with 51x51 kernels trained on ImageNet-1K for 300 epochs
| name | resolution | kernel size |acc@1 | #params | FLOPs | model | |:---:|:---:|:---:|:---:| :---:|:---:|:---:| | ConvNeXt-T | 224x224 | 7x7 | 82.1 | 29M | 4.5G | ConvNeXt | | ConvNeXt-S | 224x224 | 7x7 | 83.1 | 50M | 8.7G | ConvNeXt | | ConvNeXt-B | 224x224 | 7x7 | 83.8 | 89M | 15.4G | ConvNeXt | | SLaK-T | 224x224 | 51x51 |82.5 | 30M | 5.0G | Google Drive | | SLaK-S | 224x224 | 51x51 | 83.8 | 55M | 9.8G | Google Drive | | SLaK-B | 224x224 | 51x51 | 84.0 | 95M | 17.1G | Google Drive |
SLaK-T with 31x31, 51,51, and 61x61 kernels trained on ImageNet-1K for 120 epochs
| name | resolution | kernel size |acc@1 | #params | FLOPs | model | |:---:|:---:|:---:|:---:| :---:|:---:|:---:| | SLaK-T | 224x224 | 31x31 | 81.5 | 30M | 4.8G | Surf Drive | | SLaK-T | 224x224 | 51x51 | 81.6 | 30M | 5.0G | Surf Drive | | SLaK-T | 224x224 | 61x61 | 81.5 | 31M | 5.2G | Surf Drive |
ConvNeXt distilled from SLaK via large-2-small kernel distillation on ImageNet-1K for 300 epochs
| name | resolution | kernel size |acc@1 | #params | FLOPs | model | |:---:|:---:|:---:|:---:| :---:|:---:|:---:| | ConvNeXt-T | 224x224 | 7x7 | 82.1 | 29M | 4.5G | ConvNeXt | | ConvNeXt-S | 224x224 | 7x7 | 83.1 | 50M | 8.7G | ConvNeXt | | ConvNeXt L2S-T | 224x224 | 7x7 | 83.1 | 29M | 4.5G | Surf Drive | | ConvNeXt L2S-S | 224x224 | 7x7 | 84.3 | 50M | 8.7G | Surf Drive |
Installation
The code is tested used CUDA 11.3.1, cudnn 8.2.0, PyTorch 1.10.0 with A100 GPUs.
Dependency Setup
Create an new conda virtual environment
conda create -n slak python=3.8 -y
conda activate slak
Install Pytorch>=1.10.0. For example:
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
Clone this repo and install required packages:
git clone https://github.com/Shiweiliuiiiiiii/SLaK.git
pip install timm tensorboardX six
To enable training SLaK, we follow RepLKNet and install the efficient large-kernel convolution with PyTorch provided by MegEngine:
cd cutlass/examples/19_large_depthwise_conv2d_torch_extension./setup.py install --user. If you get errors, (1) check yourCUDA_HOME; (2) you might need to change the source code a bit to make tensors contiguous see here for example.- A quick check:
python depthwise_conv2d_implicit_gemm.py - Add
WHERE_YOU_CLONED_CUTLASS/examples/19_large_depthwise_conv2d_torch_extensioninto yourPYTHONPATHso that you canfrom depthwise_conv2d_implicit_gemm import DepthWiseConv2dImplicitGEMManywhere. Then you may useDepthWiseConv2dImplicitGEMMas a replacement ofnn.Conv2d. export LARGE_KERNEL_CONV_IMPL=WHERE_YOU_CLONED_CUTLASS/examples/19_large_depthwise_conv2d_torch_extensionso that RepLKNet will use the efficient implementation. Or you may simply modify the related code (get_conv2d) inSLaK.py.
Training code
We provide ImageNet-1K training, and ImageNet-1K fine-tuning commands here.
ImageNet-1K SLaK-T on a single machine
python -m torch.distributed.launch --nproc_per_node=4 main.py \
--Decom True --sparse --width_factor 1.3 -u 2000 --sparsity 0.4 --sparse_init snip --prune_rate 0.5 --growth random \
--epochs 300 --model SLaK_tiny --drop_path 0.1 --batch_size 128 \
--lr 4e-3 --update_freq 8 --model_ema true --model_ema_eval true \
--data_path /path/to/imagenet-1k --num_workers 40 \
--kernel_size 51 49 47 13 5 --output_dir /path/to/save_results
- To enable to train/evaluate SLaK models, make sure that you add
--sparse --Decom True --kernel_size 51 49 47 13 5 --sparse_init snipin your script.--sparse: enable sparse model;--sparsity: model sparsity;--width_factor: model width;-u: adaptation frequency;--prune_rate: adaptation rate,--kernel_size: [4 * (kernel size of each stage) + the size of the smaller kernel edge]. - You can add
--use_amp trueto train in PyTorch's Automatic Mixed Precision (AMP). - Use
--resume /path_or_url/to/checkpoint.pthto resume training from a previous checkpoint; use--auto_resume trueto auto-resume from latest checkpoint in the specified output folder. To resume the training of sparse models, we need to set--sparse_init resumeto get the masks. --batch_size: batch size per GPU;--update_freq: gradient accumulation steps.- The effective batch size =
--nodes*--ngpus*--batch_size*--update_freq. In the example above, the effective batch size is4*8*128*1 = 4096. You can adjust these four arguments together to keep the effective batch size at 4096 and avoid OOM issues, based on the model size, number of nodes and GPU memory.
ImageNet-1K SLaK-S on a single machine
python -m torch.distributed.launch --nproc_per_node=8 main.py \
--Decom True --sparse --width_factor 1.3 -u 100 --sparsity 0.4 --sparse_init snip --prune_rate 0.3 --growth random \
--epochs 300 --model SLaK_small --drop_path 0.4 --batch_size 64 \
--lr 4e-3 --update_freq 8 --model_ema true --model_ema_eval true \
--data_path /path/to/imagenet-1k --num_workers 40 \
--kernel_size 51 49 47 13 5 --output_dir /path/to/save_results
ImageNet-1K SLaK-B on a single machine
python -m torch.distributed.launch --nproc_per_node=16 main.py \
--Decom True --sparse --width_factor 1.3 -u 100 --sparsity 0.4 --sparse_init snip --prune_rate 0.3 --growth random \
--epochs 300 --model SLaK_base --drop_path 0.5 --batch_size 32 \
--lr 4e-3 --update_freq 8 --model_ema true --model_ema_eval true \
--data_path /path/to/imagenet-1k --num_workers 40 \
--kernel_size 51 49 47 13 5 --output_dir /path/to/save_results
To run ConvNeXt, simple set the kernel size as --kernel_size 7 7 7 7 100. (Make sure that the last number is larger than the first four numbers)
Training code for large-kernel distillation
Distilling SLaK-S to ConNeXt-S with NKD, 300 epoches
python -m torch.distributed.launch --nproc_per_node=4 main_KD.py \
--resume /path/to/SLaK-Small/checkpoint --Decom True --T 3.0 --width_factor 1.3 -u 2000 --distill_resume --lr_fd 3e-5 --epochs 300 --model SLaK_small --distill_type NKD --model_s SLaK_small --drop_path 0.1 --batch_size 64 --lr 4e-3 --update_freq 16 --model_ema true --model_ema_eval false \
--data_path /path/to/imagenet-1k --num_workers 40 \
--kernel_size 51 49 47 13 5 --output_dir /path/to/save_results
Distilling SLaK-T to ConNeXt-T with NKD, 300 epoches
outdir=/gpfs/work3/0/prjste21060/projects/datasets/T3_bnTrue_NKD_STConvNext_300ep
python -m torch.distributed.launch --nproc_per_node=4 main_KD.py \
--resume /path/to/SLaK-tiny/checkpoint --Decom True --T 3.0 --width_factor 1.3 -u 2000 --lr
