MSUNet-v3

MSUNet-v3 is a deep convolutional neural network that won 4th place in the CIFAR-100 image classification task in the 2019 NeurIPS Google MicroNet Challenge.

It exploits the power of mixed depthwise convolution, quantization and sparsification to achieve lightweight yet effective network architecture.

Results

|Model | Top-1 Test Accuracy | #Flops | #Parameters| Score| |---|---|---|---|---| |MSUNet-v1| 80.47%| 118.6 M| 0.2119 M | 0.01711 | |MSUNet-v2| 80.30% | 97.01 M | 0.1204 M | 0.01255 | |MSUNet-v3| 80.13% | 85.27 M | 0.09853 M | 0.01083 |

We follow the training-and-pruning pipeline, where we first train a network with ternary weights and then prune the network to further sparsify the squeeze-excitation and dw-conv layers and quantize the weights to FP16 in the meantime.

The test accuracy in the training stage: alt text

The test accuracy in the pruning stage: alt text Note that the model reached the target sparsity after 100 epochs.

Design details

MSUNet is designed based on four key techniques: 1) ternary conv layers, 2) sparse conv layers, 3) quantization and 4). self-supervised consistency regularizer. The details of these techniques are briefly described below.

In terms of implementation, we use pytorch to implement our model. Our repository is built on top of pytorch-image-models (by Ross Wightman).

Ternary convolutional layers
- Some convolutional layers are ternarized, i.e., the weights are either -1, 0 or +1.
- Although our implementation allows binary weight, we find that ternary weights generally perform better than binary. Therefore, we stick to ternary weights for some convolution layers.
- We follow an approach similar to Training wide residual networks for deployment using a single bit for each weight to represent the ternary weights, that is, where W is the weight, Ternary(W) quantizes the weight to (-1,0,+1) and is a FP16 multiplier that scales all the weights in a particular convolutional layer.
- The code snippet that reflects the ternary operation in validate.py is as below
```
class ForwardSign(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
    global alpha
    x_ternary = (x - x.mean())/x.std()
    ones = (x_ternary > alpha).type(torch.cuda.FloatTensor)
    neg_ones = -1 * (x_ternary < -alpha).type(torch.cuda.FloatTensor)
    x_ternary = ones + neg_ones
    multiplier = math.sqrt(2. / (x.shape[1] * x.shape[2] * x.shape[3]) * x_ternary.numel() / x_ternary.nonzero().size(0) )
    if args.amp:
        return (x_ternary.type(torch.cuda.HalfTensor), torch.tensor(multiplier).type(torch.cuda.HalfTensor))
    else:
        return (x_ternary.type(torch.cuda.FloatTensor), torch.tensor(multiplier).type(torch.cuda.FloatTensor))
      
```
- As there is a scale factor for the weights and we are implementing fake quantization, we assume that the multiplicatin of scale factor is performed after convolving the input with ternary weights.
- Also note that as the ternary weights tend to be sparse in our implementation, we assume that they are compatible with sparse matrix storage and sparse math operation.
- Therefore, the overall flops is calculated from three parts: 1) sparse 1-bit (-1, 1) multiplication in the convolution operation; 2) sparse FP32 addition in the convolution operation; and 3) FP16 multiplication for multiplying the scale factor on the output of the layer.
- And the number of parameters is calculated from three parts: 1) 1-bit (-1,1) representation of the non-zero values in weights; 2) bitmask of the full weights; 3) an FP16 scale factor for each convolutional layer.
Sparse Squeeze-excitation and dw-conv layers
- Same as pytorch-image-models, we use 1x1 convolution performed on features with spatial dimension to perform squeeze-and-excitation, which is equivalent to the fully connected layer implementation.
- In order to make the weights sparse, we perform pruning on the weights of squeeze-excitation layers and dw-conv layers.
- Therefore, the number of additions and multiplication comes from the sparse 1x1 conv or dw-conv, that is, 1) 16-bit multiplication, and 2) 32-bit addition.
- And the number of parameters comes from two parts: 1) FP16 non-zeros values of the weights; 2) bitmask of the full weights.
Mixed Precision
- We implement mixed precision training using NVIDIA's apex.amp tool with opt_level=O2, which casts the model weights to FP16, patches the model's forward method to cast data to FP16, and keeps batchnorms in FP32 at training time for numerical stability purpose.
- At test time, we implement mixed precision using apex.amp with opt_level=O3, which further casts the batch normalization layers to FP16 with little affect on the testing accuracy.
Cutmix + self-supervised consistency regularizer
- During training, we use cutmix as a data augmentation technique. In addition, we propose a self-supervised consistency regularizer, enforcing feature-level consistency between cutmix data points in the feature space and the mixed features of individual data points without using the label information. We found it helps to predict consistent soft-labels at cutmix points and observed further accuracy improvement in the training.

Scoring

The flops and parameters are counted by their respective forward hooks.
As stated previously, layers with ternary weights are counted as sparse 1-bit multiplications, sparse 32-bit additions and sparse 1-bit matrix storage. We include the following code in counting.py to calculate the sparsity, bit mask and quantization divider:

    # For nn.Conv2d:    d is the quantization denominator for non-ternary/binary weights and operations
    #                   bd is the quantization denominator for ternary/binary weights and operations
    
    # For 32-bit full precision:    d==1
    # For 16-bit half precision:    d==2
    # Note that the additions inside GEMM are considered FP32 additions
    d = get_denominator(m, input, output)

    if isinstance(m, nn.Conv2d):
        # Normal convolution, depthwise convolution, and 1x1 pointwise convolution,
        # with sparse and/or ternary/binary weights are all handled in this block.

        c_out, c_in, k, k2 = m.weight.size()
        # Square kerenl expected
        assert k == k2, 'The kernel is not square.'

        if hasattr(m, '_get_weight'):
            # The module having _get_weight attributed is Ternarized.

            # Ternary weight is considered as sparse binary weights,
            # so we use a quantization denominator 32 for multiplication and storage.
            bd = 32 # denominator for binary mults and parameters
            if binarizable == 'T':
                # Using ternary quantization
                #print('Using Ternary weights')

                # Since ternary weights are considered as sparse binary weights,
                # we do have to store a bit mask to represent sparsity.
                local_param_count += c_out * c_in * k * k / 32
                sparsity = (m._get_weight('weight')[0].numel() - m._get_weight('weight')[0].nonzero().size(0)) / m._get_weight('weight')[0].numel()

                # Since our ternary/binary weights are scaled by a global factor in each layer,
                # we do have to store a FP32/FP16 digit to represent it.
                local_param_count += 1 / d # The scale factor
            elif binarizable == 'B':
                # Using binary quantization
                # Although we support binary quantization, our we prefer to use ternary quantization.
                #print('Using Binary weights')
                # The FP32/FP16 scale factor
                local_param_count += 1 / d
                sparsity = 0
            else:
                raise ValueError('Option args.binarizable is incorrect')

            # Since our ternary/binary weights are scaled by a global factor, sqrt(M), in each layer,
            # which can be considered as multiplying a scale factor on the output of the sparse binary convolution.
            # We count it as FP32/FP16 multiplication on the output.
            local_flop_mults += np.prod(output.size()) / d
  
    ...
  
        # Number of parameters
        # For sparse parameters:                sparsity > 0
        # For dense parameters:                 sparsity=0
        # For 1-bit binary parameters:          bd==32
        # For 32-bit full precision parameters: bd==1
        # For 16-bit half precision parameters: bd==d==2
        # For depthwise convolution:            c_in==1
        local_param_count += c_out * c_in * k * k / bd * (1-sparsity)

        # Number of multiplications in convolution
        # For sparse multiplication:                sparsity > 0
        # For dense multiplication:                 sparsity=0
        # For 1-bit binary multiplication:          bd==32
        # For 32-bit full precision multiplication: bd==1
        # For 16-bit half precision parameters:     bd

MSUNet

Install / Use

README

MSUNet-v3

Results

Design details