FastSegFormer: A knowledge distillation-based method for real-time semantic segmentation of surface defects in navel oranges

中文

This is the official repository for our work: FastSegFormer(PDF)

News

This work was accepted for publication in the journal Computers and Electronics in Agriculture on December 29, 2023.

Highlights

Performance of different models on navel orange dataset (test set) against their detection speed on RTX3060:

Performance of different models on navel orange dataset (test set) against their parameters:

Updates

[x] The training and testing codes are available here.(April/25/2023)
[x] Create PyQT interface for navel orange defect segmentation. (May/10/2023)
[x] Produce 30 frames of navel orange assembly line simulation video. (May/13/2023)
[x] Add yolov8n-seg and yolov8-seg instance segmentation training, test, and prediction results.Jump to(December/10/2023)

Demos

Some demos of the segmentation performance of our proposed FastSegFormer:Original image(left) and Label image(middle) and FastSegFormer-P(right). The original image contains enhanced image.

<div align="center"> <img src="Images/predict_pngs/img826/img826.jpg" width="200" height="200"> <img src="Images/predict_pngs/img826/img826.png" width="200" height="200"> <img src="Images/predict_pngs/img826/FastSegFormer-P-img826.png" width="200" height="200"> <img src="Images/predict_pngs/img585/img585_enhance_a6.jpg" width="200" height="200"> <img src="Images/predict_pngs/img585/img585.png" width="200" height="200"> <img src="Images/predict_pngs/img585/FastSegFormer-P-img585.png" height="200"> <img src="Images/predict_pngs/img663/img663_enhance_a1.jpg" width="200" height="200"> <img src="Images/predict_pngs/img663/img663_enhance_a1.png" width="200" height="200"> <img src="Images/predict_pngs/img663/FastSegFormer-P-img663.png" width="200" height="200"> </div>

A demo of Navel Orange Video Segmentation:Original video(left) and detection video(right). The actual detection video reaches 45~55 fps by weighted half-precision (FP16) quantization technique and multi-thread processing technique.(The actual video detection is the total latency of pre-processing, inference and post-processing of the image). Navel orange defect picture and video detection UI is available at FastSegFormer-pyqt.

<img src="Images/orange_video.gif" alt="Cityscapes" width="360"/> <img src="Images/orange_detection_video.gif" alt="Cityscapes" width="360"/> Navel orange simulation line detection video

Overview

An overview of the architecture of our proposed FastSegFormer-P. The architecture of FastSegFormer-E is derived from FastSegFormer-P replacing the backbone network EfficientFormerV2-S0.
An overview of the proposed multi-resolution knowledge distillation.(To solve the problem that the size and number of channels of the teacher network and student network feature maps are different:the teacher network's feature maps are down-sampled by bilinear interpolation, and the student network's feature maps are convolved point-by-point to increase the number of channels)

P&KL loss:

$$ L_{logits}(\text{S}) = \frac{1}{W_{s}\times H_{s}}(k_1t^2 \sum_{i \in R}\text{KL}(q_i^s, q_i^t) + (1 - k_1)\sum_{i \in R}\text{MSE}(p_i^s, p_i^t)) $$

Where $q_{i}^s$ represents the class probability of the $i$ th pixel output from the simple network S, $q_{i}^t$ represents the class probability of the $i$ th pixel output from the complex network T, $\text{KL}(\cdot)$ represents Kullback-Leibler divergence, $p_{i}^s$ represents the $i$ th pixel output from the simple network S, $p_{i}^t$ represents the $i$ th pixel output from the complex network T, $\text{MSE}(\cdot)$ represents the mean square error calculation, $R={1,2,..., W_s\times H_s}$ represents all pixels, and $t$ represents the temperature coefficient. In this experiment, $t=2$, $k_1=0.5$.

NFD loss:

$$ L_{n}^{NFD} = \sum_{i=1}^n \frac{1}{W_s\times H_s} L_2(\text{Normal}(F_{i}^t), \text{Normal}(F_{i}^s)) $$

Where $n$ represents the number of intermediate feature maps, $W_s$ and $H_s$ represent the height and width of the simple model feature map, $L_2(\cdot)$ represents the Euclidean calculation of the feature maps, $F_{i}^t$ represents the $i$ th feature map generated by the complex network T, $F_{i}^s$ represents the $i$ th feature map generated by the simple network S, $\text{Normal}$ represents the normalization of the feature maps on $(W, H)$, the $\text{Normal}(\cdot)$ is given as follows:

$$ \bar{F} = \frac{1}{\sigma}(F - u) $$

where $F$ represents the original feature map, $\bar{F}$ represents the feature transform, and $u$ and $\sigma$ represent the mean and standard deviation of the features.

Models

Pretrained backbone network:

Teacher network:

| Model | Input size | mIoU(%) | mPA(%) | Params | GFLOPs | ckpt | |:---------------:|:---------------:|:-------:|:------:|:------:|:------:|:-----------------------------------------------------------------------------------------------------------------------------------------------:| | Swin-T-Att-UNet | $512\times 512$ | 90.53 | 94.65 | 49.21M | 77.80 | download |

FastSegFormer after fine-tuning and knowledge distillation:

| Model | Input size | mIoU(%) | mPA(%) | Params | GFLOPs | RTX3060(FPS) | RTX3050Ti(FPS) | ckpt | onnx | |:----------------:|:----------------:|:-------:|:------:|:------:|:------:|:------------:|:--------------:|:---------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------:| | FastSegFormer-E | $224\times 224$ | 88.78 | 93.33 | 5.01M | 0.80 | 61 | 54 | download | download | | FastSegFormer-P | $224\times 224$ | 89.33 | 93.78 | 14.87M | 2.70 | 108 | 93 | download | download |

Ablation study

You can see all results and process of our experiment in logs dir, which include ablation study and comparison with other lightweight models.

The Acc.(mIoU) of FastSegFormer models with different network structure(PPM, MSP and Image reconstruction branch) on validation set:

Knowledge distillation(KD) and fine-tuning(†):

| Model | mIoU(%) | mPA(%) | mPrecision(%) | Params | GFLOPs | |:----------------------------------:|:-------:|:------:|:-------------:|:------:|:------:| | FastSegFormer-E | 86.51 | 91.63 |

FastSegFormer

Install / Use

README