MCLE

Official Code for Paper《Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA》

Generate Convert Improve

Install / Use

/learn @laichengen/MCLE

About this skill

Quality Score

0/100

README

MCLE

Official Code for Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

Requirements

PyTorch 2.3.0
qwen-vl-utils (install with pip install qwen-vl-utils[decord]==0.0.8)
transformers (install with pip install transformers)
peft (install with pip install peft)

Quick Start

Model Download

The fine-tuned checkpoint of Qwen2.5-VL-3B-Instruct on VQA-X dataset can be downloaded here: Qwen-SFT. Place it in checkpoints/Qwen2.5-SFT.

Dataset

The training set of demo can be found in dataset/NLE/Demo/train.json, where the ratio > 0 denotes that the case is positive, the ratio < 0 denotes that the case is negative.

The test set of demo can be found in dataset/NLE/Demo/test.json, where the image is:

the question is What kind of building is this?.

the ground-truth is Because it contains a large light at the top. So the answer is lighthouse..

Code

First, you can evaluate the test set with the Qwen2.5-SFT, which is the checkpoint of Qwen2.5-VL-3B-Instruct that fine-tuned on VQA-X:

python evaluate.py

The output of is Because it has a lighthouse on top. So the answer is ship., which is inconsistent with the ground-truth.

Second, you can train the Qwen2.5-SFT model with Multi-Level Contrastive Learning.

python train.py

Third, you can evaluate the test set with the MCLE, which is the checkpoint of Qwen2.5-SFT that trained on Demo train set with Multi-Level Contrastive Learning:

python evaluate.py --ckpt output/MCLE/checkpoint-3

The output of is Because it has a lighthouse on top. So the answer is lighthouse, which reduces the inconsistency of multimodal LLM in VQA.

Training on VQA-X dataset

Model Download

We utilize Qwen2.5-VL-3B-Instruct that released by Alibaba as the backbone in our model. The pretrained model can be downloaded here: Qwen2.5-VL-3B-Instruct. Place it in checkpoints/Qwen2.5-VL-3B-Instruct.

Dataset Download

The training set of VQA-X dataset with part of negatives samples can be found in dataset/NLE/VQA-X/train.json.

The test set of VQA-X can be found in dataset/NLE/VQA-X/test.json

The image files can be downloaded here: COCO train2014 and val2014. Place it in dataset/NLE/VQA-X.

Code

Train the Qwen2.5-VL-3B-Instruct model with Multi-Level Contrastive Learning.

python train.py --model_path checkpoints/Qwen2.5-VL-3B-Instruct --train_path dataset/NLE/VQA-X/train.json --learning_rate 1e-5

Evaluate the model on the VQA-X test set with the checkpoint of Qwen2.5-VL-3B-Instruct that trained on VQA-X train set.

python evaluate.py --model_path checkpoints/Qwen2.5-VL-3B-Instruct --test_path dataset/NLE/VQA-X/test.json --ckpt output/MCLE/checkpoint-** --metric True

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

API

A learning and reflection platform designed to cultivate clarity, resilience, and antifragile thinking in an uncertain world.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

laichengen

View profile

View on GitHub

GitHub Stars54

CategoryEducation

Updated8mo ago

Forks5

laichengen/MCLE

Languages

Python

Security Score

72/100

Audited on Jul 3, 2025

No findings