PreSel

[CVPR 2025] An Implementation of the paper "Pre-Instruction Data Selection for Visual Instruction Tuning"

Generate Convert Improve

Install / Use

/learn @bardisafa/PreSel

About this skill

Quality Score

0/100

README

PreSel: Pre-Instruction Data Selection <br> for Visual Instruction Tuning

🌟 CVPR 2025 Highlight Paper 🌟

Bardia Safaei Faizan Siddiqui Jiacong Xu Vishal M. Patel Shao-Yuan Lo

Johns Hopkins University, Honda Research Institute USA

</div> <hr />

Release Notes

[06/08/2025]: 🔥 PreSel codebase is released. The selected 15% data and the finetuned models on these selected data can be downloaded now.

<hr />

Installation
- Prepare the Environment
Dataset Preparation
Usage
Finetuned Models & Selected Data
Evaluation
Citation

<hr />

Installation

1. Prepare the Environment

Please first install LLaVA：

cd PreSel
git clone https://github.com/haotian-liu/LLaVA.git

Then prepare the environment for LLaVA here.

Dataset Preparation

1. Download the Datasets

LLaVA-665K Dataset

For the LLaVA dataset, please download the LLaVA-665K dataset following the instructions from the LLaVA GitHub repository. This dataset is used for visual instruction tuning and contains a diverse set of visual-language examples.

Vision-FLAN Dataset

For the Vision-FLAN dataset, please download the data from the Vision-FLAN website. This dataset provides a comprehensive collection of visual-language tasks for instruction tuning.

After downloading the datasets, please place all data files in the /datasets directory.

2. Preprocess the Dataset

We first add a unique index for each instruction in the original dataset, to better identify each sample:

python data_process/preprocess.py \
    --raw_annotation_path datasets/your_dataset.json \
    --new_annotation_save_path datasets/processed_dataset.json

This script adds a unique identifier to each sample in your dataset, which is essential for the data selection process. The processed dataset will be saved to the specified path. We will be using the json files with the unique_idx included in the code.

Please note that as stated in the paper, for the LLaVA-1.5 dataset we remove the text-only instructions from the data, as our method focuses on selecting the images. You can either remove them yourself or use the already processed json file here.

3. Task Splits

For our method, we need to split the dataset into different tasks. We provide the task splits used in our experiments:

LLaVA-1.5 task splits: Download splits
Vision-FLAN dataset: Download splits

Place the downloaded and unzipped task split files in the data/ directory.

4. Reference Model Training

To estimate task importance values, we need a reference model trained on a small randomly selected reference dataset. You have two options:

Option 1: Use Our Pre-selected Reference Datasets

For LLaVA-1.5 and Vision-FLAN datasets, you can directly use our randomly selected reference datasets (5% of images and their corresponding instructions from each task):

LLaVA-1.5 reference data (randomly selected 5% images with instructions): Download JSON
Vision-FLAN reference data (randomly selected 5% images with instructions): Download JSON

Place the downloaded JSON files in the data/ directory.

Option 2: Create Your Own Reference Dataset

For custom datasets, you'll need to create a reference dataset by randomly sampling 5% of images along with their corresponding instructions from each task.

After preparing the reference dataset, fine-tune a LLaVA-7B model on it to obtain the reference model. For this step:

Fine-tune the LLaVA-7B model huggingface using LoRA training following the script provided here

This reference model will be used in later steps to estimate task-importance values.

Usage

1. Loss/Perplexity Calculations

First, process the reference data to remove the question parts of the instructions:

python data_process/remove_instruction.py \
    --input_path /data/round1_665k_notext.json \
    --output_path /data/round1_665k_notext_img_token.json

This will create a new file (/data/round1_665k_notext_img_token.json).

Then run the loss/perplexity calculations twice:

python presel/loss_ppl_calc.py \
    --data_path /data/round1_665k_notext.json \
    --model_path /PATH/TO/REFERENCE_MODEL \
    --image_folder /datasets \
    --output_file /data/loss_ppl_round1_665k_notext.json

python presel/loss_ppl_calc.py \
    --data_path /data/round1_665k_notext_img_token.json \
    --model_path /PATH/TO/REFERENCE_MODEL \
    --image_folder /datasets \
    --output_file /data/loss_ppl_round1_665k_notext_img_token.json

Replace /PATH/TO/REFERENCE_MODEL with the path to your reference model checkpoint.
Adjust --image_folder and --output_file as needed for your setup.

2. Task Importance Estimation

Run the following to get the estimated task-importance values required for our data selection approach:

python presel/llava_task_importance.py \
    --data_w_path /data/loss_ppl_round1_665k_notext.json \
    --data_wo_path /data/loss_ppl_round1_665k_notext_img_token.json \
    --reference_data_path /data/round1_665k_notext.json \
    --task_files_dir /data \
    --output_dir /data

3. Pre-Instruction Data Selection

First, we extract the visual features using DINOv2 model for each task (1 to 10 for the LLaVA dataset):

python data_process/extract_feats_665_dino.py --task_num TASK_NUM

Then run k-means clustering and sample selection:

python data_process/kmeans_clust.py --method typical

Finally, run the following command to finetune the model on the selected data. Make sure to set the BASE_DIR value appropriately. This code implements multi-round training where each round has a budget of 5% of the total data. Note that the results reported in the main paper correspond to round 3 (15% budget).

python presel/data_selection.py \
    --base_dir BASE_DIR \
    --method presel \
    --dataset_type llava

Running on the Vision-FLAN Dataset

For the Vision-FLAN dataset, the steps are similar to those for the LLaVA-1.5 dataset mentioned above. For "Loss/Perplexity Calculations", you can follow the same steps, but make sure to adjust the code to match the Vision-FLAN data format (e.g., JSON files, reference set, image folder, etc.).

For "Task Importance Estimation", you can directly download the estimated task importance values here and place it in /data directory.

For "Pre-Instruction Data Selection", first use the same script, data_process/extract_feats_665_dino.py, to extract VF features. Save the output as /data/dino_feats_vf/dino_feats_all_vf.pt. Then, run

python data_process/kmeans_clust_vf.py --method typical

Finally, run the following command to fine-tune the model on the selected Vision-FLAN data:

python presel/data_selection.py \
    --base_dir BASE_DIR \
    --method presel \
    --dataset_type vision_flan \
    --file_path /datasets/annotation_191-task_1k_add_idx.json

Finetuned Models & Selected Data (15%)

You can find our selected 15% subset of data via PreSel, as well as the fine-tuned models trained on it here:

| Dataset | 15% Selected Data by PreSel (JSON) | LLaVA-7B Model Finetuned | |---------|-----------------------------------|--------------------------| | LLaVA-1.5 | Download | Download | | Vision-FLAN | Download | Download |

Evaluation

Please follow the original LLaVA page and VLMEvalKit to evaluate models.

Citation

If you find this codebase useful for your research, please cite our

Related Skills

node-connect

344.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

96.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

bardisafa

View profile

View on GitHub

GitHub Stars17

CategoryDevelopment

Updated3mo ago

Forks1

bardisafa/PreSel

Languages

Python

Security Score

90/100

Audited on Jan 1, 2026

No findings

PreSel

Install / Use

README

PreSel: Pre-Instruction Data Selection <br> for Visual Instruction Tuning

Release Notes

Contents

Installation

1. Prepare the Environment

Dataset Preparation

1. Download the Datasets

LLaVA-665K Dataset

Vision-FLAN Dataset

2. Preprocess the Dataset

3. Task Splits

4. Reference Model Training

Option 1: Use Our Pre-selected Reference Datasets

Option 2: Create Your Own Reference Dataset

Usage

1. Loss/Perplexity Calculations

2. Task Importance Estimation

3. Pre-Instruction Data Selection

Running on the Vision-FLAN Dataset

Finetuned Models & Selected Data (15%)

Evaluation

Citation

Related Skills