Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

This repository provides the official PyTorch implementation of the research paper:

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition (Accepted by CVPR2026).

1.Introduction

Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for selfevolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on MLLM.

2. Dependencies

Our implementation is based on LLaMA-Factory and performs LoRA fine-tuning for training and evaluation.

We recommend using Anaconda to create the Python environment and install required libraries:

conda create -n hier python=3.12 -y
conda activate hier
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

3. Usage

3.1 Data

The data can be downloaded through the following links:

https://drive.google.com/drive/folders/1nCkhkz72F6ucseB73XVbqCaDG-pjhpSS

3.2 Configuration Files

All configuration files for different model × dataset combinations are placed at:

/LLaMA-Factory/examples/train_lora/

The YAML files follow a clear naming convention, e.g.:

qwen2vl_lora_sft_mintrec2.yaml

3.3 Run Training / Testing

# Training
CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train examples/train_lora/qwen2vl_lora_sft_mintrec2.yaml
# Testing / Evaluation
CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train examples/train_lora/qwen2vl_lora_test_mintrec2.yaml

4. Model

The overview model architecture:

HIER

5. Experimental Results

Experimental_Results

6. Citation

If you are insterested in this work, and want to use the codes or results in this repository, please star this repository and cite by:

@misc{zhou2026evolutionarymultimodalreasoninghierarchical,
      title={Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition}, 
      author={Qianrui Zhou and Hua Xu and Yunjin Gu and Yifan Wang and Songze Li and Hanlei Zhang},
      year={2026},
      eprint={2603.03827},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2603.03827}, 
}

7. Acknowledgements

Some of the code in this repository is built upon and adapted from LLaMA-Factory. We sincerely thank the authors and contributors for their open-source efforts.

If you have any questions or encounter issues, please open an issue and describe your environment, commands, and error logs as clearly as possible.