Image Segmentation Benchmark Suite

This is a comprehensive image segmentation evaluation project for comparing various types of segmentation models on datasets and outputting unified evaluation metrics. We aim to validate the practical effectiveness of CRF post-processing techniques in image segmentation post-processing through this work. This project supports evaluation modules for two default datasets: the "crackforest" dataset and the voc_2012 dataset, with support for automatic download, dataset splitting, model training/fine-tuning, inference, metric calculation, and result archiving.

Key Features

Supported data-driven baselines: CRF feature models, CNN, Transformer, DDP (Diffusion Model), Hybrid CNN-Transformer, CNN-CRF, and any model + CRF post-processing.
Automated pipeline: Read configuration → Download/Load data → Build models → Evaluate → Export JSON/CSV results.
Multi-metric evaluation: Pixel Accuracy, mIoU, Precision, Recall, F1, Dice, etc.
Structured code: Modular src/segmentation_benchmark package for easy extension of custom models or datasets.
Unit tests covering basic components (metric calculation, data pipeline, registry).

File Tree

segmentation-benchmark/
├── configs/                  # YAML configurations (default: crackforest_benchmark.yaml)
├── data/                     # Dataset download directory (auto-generated on first run)
├── scripts/                  # Command-line scripts (download data, run benchmarks, etc.)
├── src/segmentation_benchmark/
│   ├── data/                 # Dataset loading and splitting
│   ├── evaluation/           # Evaluators and registry
│   ├── metrics/              # Metric calculation
│   ├── models/               # Various segmentation model wrappers
│   └── utils/                # Configuration and path utilities
├── tests/                    # Pytest test cases
├── reports/                  # Evaluation outputs (auto-created)
├── artifacts/                # Training weights, etc. (placeholder directory)
├── requirements.txt          # Dependency list
└── pyproject.toml             # Package configuration

Dataset Information

CrackForest Dataset

Dataset Name: CrackForest Dataset (contains 118 urban road crack images)
Official Source: https://github.com/cuilimeng/CrackForest-dataset
Usage License: This dataset is limited to non-commercial research purposes. Please follow the citation requirements in the project README when using it.
Data Preparation: Can be manually downloaded by running python scripts/download_crackforest.py, or automatically downloaded when running benchmark tests.

Pascal VOC 2012 Dataset

Dataset Name: Pascal VOC 2012 (contains semantic segmentation annotations for 21 classes)
Official Source: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
Usage License: The Pascal VOC dataset follows its official usage license, typically allowing use for academic research.
Data Preparation: When download: true is set in the configuration file, the system will automatically download the dataset via torchvision; the dataset will be saved to the data/voc/VOCdevkit/VOC2012/ directory.
Note: The dataset download website is often unstable. If you need the dataset, please contact: wangfeiming@mail.nankai.edu.cn, and we will seek ways to make the dataset publicly available!

The default configuration splits the dataset into Train:Val:Test = 60% : 20% : 20%. This can be customized via YAML configuration.

Installation

This project requires Python 3.10 environment. Please note that pydensecrf needs to be compiled and installed separately. When configuring, note that Cython needs to be downgraded to version 2.X for compilation.

python -m venv .venv
.\.venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
# Install development dependencies
pip install -e .[dev]

Use the above commands in Windows PowerShell; adjust the virtual environment activation method for other platforms.

Quick Start

Dataset Download (Optional):

For the CrackForest dataset:
```
python scripts/download_crackforest.py
```
For the Pascal VOC 2012 dataset, the system will automatically download on first run (requires download: true configuration). To manually trigger, you can directly run the benchmark script, and the system will automatically detect and download missing datasets.
Run Benchmark Tests:

Using the CrackForest dataset:
```
python scripts/run_benchmark.py --config configs/crackforest_benchmark.yaml
```
Using the Pascal VOC 2012 dataset:
```
python scripts/run_benchmark.py --config configs/voc_benchmark.yaml
```
After the benchmark test completes, evaluation metrics for all models will be saved to the reports/<run_name>/ directory:
- <model>_metrics.json: Detailed evaluation metrics for individual models
- benchmark_summary.csv / benchmark_summary.json: Comparison summary table for all models
Parameter Configuration:
- --device cuda: Specify GPU execution (requires CUDA environment support)
- --skip-train: Skip training/fine-tuning stages for all models, only perform inference evaluation
- --save-predictions: Save prediction masks for each model as .npy format files

Model Architecture Overview

| Type | Registered Name | Description | | ---- | --------------- | ----------- | | Feature + CRF | classical_crf | Handcrafted features + Random Forest + DenseCRF | | CNN | fcn_resnet50, deeplabv3_resnet50 | Torchvision semantic segmentation backbones, fine-tunable | | Transformer | segformer_b0 | HuggingFace SegFormer-B0 model | | Diffusion Style | random_walker | DDP (Diffusion Model) | | Hybrid | hybrid_unet_transformer | Custom CNN + Multi-head self-attention hybrid model | | CNN-CRF | cnn_crf | CNN prediction + DenseCRF end-to-end combination | | Any Model + CRF Post-processing | crf_wrapper | Wraps any registered model and appends DenseCRF post-processing |

All models are registered via segmentation_benchmark.evaluation.registry and can be easily extended.

CRF Post-processing Evaluation

This project supports adding DenseCRF post-processing evaluation to any registered model. CRF post-processing can improve the smoothness and accuracy of segmentation boundaries. In the configuration file, use the crf_wrapper builder to wrap any base model:

models:
  # Base model
  - name: fcn_resnet50
    builder: fcn_resnet50
    params:
      pretrained: true
      finetune_epochs: 50
  
  # CRF post-processing version of the same model
  - name: fcn_resnet50_crf_post
    builder: crf_wrapper
    params:
      base_builder: fcn_resnet50  # Specify the base model to wrap
      base_params:               # Base model parameters
        pretrained: true
        finetune_epochs: 0       # Usually use pretrained model, no fine-tuning
      crf_params:                # CRF post-processing parameters
        iterations: 5            # CRF iteration count
        gaussian_sxy: 3          # Gaussian smoothing parameter
        bilateral_sxy: 80        # Bilateral filter spatial parameter
        bilateral_srgb: 13       # Bilateral filter color parameter

CRF Post-processing Parameter Description:

iterations: CRF inference iteration count (default 5, more iterations may improve results but increase computation time)
gaussian_sxy: Spatial standard deviation for Gaussian smoothing (default 3)
bilateral_sxy: Spatial standard deviation for bilateral filtering (default 50-80)
bilateral_srgb: Color standard deviation for bilateral filtering (default 13)
compat_gaussian: Gaussian compatibility weight (default 3)
compat_bilateral: Bilateral compatibility weight (default 10)

The current configuration file has added CRF post-processing evaluation versions for all major models (FCN, DeepLabV3, SegFormer, Hybrid UNet), and you can directly run benchmark tests for comparison.

Automatic Checkpoint Management

This framework supports intelligent checkpoint management, automatically saving and loading trained models:

Auto-save: After training completes, models are automatically saved to the artifacts/checkpoints/ directory
Auto-load: If the configuration is the same, previously trained models will be automatically loaded on the next run, skipping the training stage
CRF Post-processing Auto-matching: CRF post-processing versions will automatically find and use trained base model checkpoints

How It Works:

Each checkpoint generates a unique hash based on model configuration (model name, number of classes, learning rate, training epochs, etc.)
When configurations match exactly, the corresponding checkpoint is automatically loaded
CRF wrapper intelligently finds matching base model checkpoints (ignoring the finetune_epochs parameter)

Example:

# First run: Train and save checkpoint
- name: fcn_resnet50
  builder: fcn_resnet50
  params:
    pretrained: true
    finetune_epochs: 50  # Train for 50 epochs, automatically saved after training

# Second run: Automatically load checkpoint, skip training
- name: fcn_resnet50_crf_post
  builder: crf_wrapper
  params:
    base_builder: fcn_resnet50
    base_params:
      pretrained: true
      finetune_epochs: 0  # Automatically find and load trained checkpoint

**Manual Checkp

DMML2025

Install / Use

README