AdderNetCUDA
An addernet CUDA version.
Install / Use
/learn @LingYeAI/AdderNetCUDAREADME
Training addernet accelerated by CUDA
Usage
cd adder_cuda
python setup.py install
cd ..
python main.py
Environment
pytorch 1.10.0 CUDA 11.3
Benchmark
| version | training_time_per_batch/s | | --------------------------------------------------------- | ------------------------- | | raw | 1.61 | | torch.cdist | 1.49 | | cuda_unoptimized | 0.4508 | | this work | 0.3158 |
The CUDA version of AdderNet has achieved a 5× speed increase over the original version. There seems to be some bugs in the Cuda_unoptimized version, causing the model to fail to converge. Its speed is still listed here for comparison. The experiment was run on RTX 2080Ti platform, and ResNet-20 based on CIFAR-10 was trained.
|Time(%)| Time |Calls |Avg |Min |Max |Name| |-------|-----------|-------|-----------|-----------|-----------|----| |48.57 |30.4752s |3920 |7.7743ms |162.70us |12.271ms |CONV_BACKWARD| |34.85 |21.8686s |19680 |1.1112ms |5.3770us |11.827ms |_ZN2at6native27unrolled_elementwise_kernel...| |7.46 |4.67901s |5920 |790.37us |26.529us |1.5841ms |CONV| |2.24 |1.40372s |3920 |358.09us |31.298us |845.80us |col2im_kernel| |2.10 |1.31882s |36862 |35.777us |1.4720us |276.24us |vectorized_elementwise_kernel| |1.43 |900.03ms |5920 |152.03us |7.9040us |372.40us |im2col_kernel|
Here is the time distribution of training an epoch. If you are interested, you can continue to optimize the CUDA kernel.
