MCLE
Official Code for Paper《Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA》
Install / Use
/learn @laichengen/MCLEREADME
MCLE
Official Code for Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA
Requirements
- PyTorch 2.3.0
- qwen-vl-utils (install with
pip install qwen-vl-utils[decord]==0.0.8) - transformers (install with
pip install transformers) - peft (install with
pip install peft)
Quick Start
Model Download
The fine-tuned checkpoint of Qwen2.5-VL-3B-Instruct on VQA-X dataset can be downloaded here: Qwen-SFT. Place it in checkpoints/Qwen2.5-SFT.
Dataset
The training set of demo can be found in dataset/NLE/Demo/train.json, where the ratio > 0 denotes that the case is positive, the ratio < 0 denotes that the case is negative.
The test set of demo can be found in dataset/NLE/Demo/test.json, where the image is:
the question is What kind of building is this?.
the ground-truth is Because it contains a large light at the top. So the answer is lighthouse..
Code
First, you can evaluate the test set with the Qwen2.5-SFT, which is the checkpoint of Qwen2.5-VL-3B-Instruct that fine-tuned on VQA-X:
python evaluate.py
The output of is Because it has a lighthouse on top. So the answer is ship., which is inconsistent with the ground-truth.
Second, you can train the Qwen2.5-SFT model with Multi-Level Contrastive Learning.
python train.py
Third, you can evaluate the test set with the MCLE, which is the checkpoint of Qwen2.5-SFT that trained on Demo train set with Multi-Level Contrastive Learning:
python evaluate.py --ckpt output/MCLE/checkpoint-3
The output of is Because it has a lighthouse on top. So the answer is lighthouse, which reduces the inconsistency of multimodal LLM in VQA.
Training on VQA-X dataset
Model Download
We utilize Qwen2.5-VL-3B-Instruct that released by Alibaba as the backbone in our model. The pretrained model can be downloaded here: Qwen2.5-VL-3B-Instruct. Place it in checkpoints/Qwen2.5-VL-3B-Instruct.
Dataset Download
The training set of VQA-X dataset with part of negatives samples can be found in dataset/NLE/VQA-X/train.json.
The test set of VQA-X can be found in dataset/NLE/VQA-X/test.json
The image files can be downloaded here: COCO train2014 and val2014. Place it in dataset/NLE/VQA-X.
Code
Train the Qwen2.5-VL-3B-Instruct model with Multi-Level Contrastive Learning.
python train.py --model_path checkpoints/Qwen2.5-VL-3B-Instruct --train_path dataset/NLE/VQA-X/train.json --learning_rate 1e-5
Evaluate the model on the VQA-X test set with the checkpoint of Qwen2.5-VL-3B-Instruct that trained on VQA-X train set.
python evaluate.py --model_path checkpoints/Qwen2.5-VL-3B-Instruct --test_path dataset/NLE/VQA-X/test.json --ckpt output/MCLE/checkpoint-** --metric True
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
API
A learning and reflection platform designed to cultivate clarity, resilience, and antifragile thinking in an uncertain world.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
