SkillAgentSearch skills...

PreSel

[CVPR 2025] An Implementation of the paper "Pre-Instruction Data Selection for Visual Instruction Tuning"

Install / Use

/learn @bardisafa/PreSel
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

PreSel: Pre-Instruction Data Selection <br> for Visual Instruction Tuning

<img src="https://img.shields.io/badge/CVPR-2025-FFA500?style=for-the-badge&logo=google-scholar&logoColor=white">

🌟 CVPR 2025 Highlight Paper 🌟

Bardia SafaeiFaizan SiddiquiJiacong XuVishal M. PatelShao-Yuan Lo

Johns Hopkins University, Honda Research Institute USA

<a href='https://bardisafa.github.io/PreSel/'><img src='https://img.shields.io/badge/Project-Page-blue'></a> <a href='https://arxiv.org/abs/2503.07591'><img src='https://img.shields.io/badge/Paper-arXiv-red'></a>

</div> <hr />

Release Notes

  • [06/08/2025]: 🔥 PreSel codebase is released. The selected 15% data and the finetuned models on these selected data can be downloaded now.
<hr />

Contents

<hr />

Installation

1. Prepare the Environment

Please first install LLaVA:

cd PreSel
git clone https://github.com/haotian-liu/LLaVA.git

Then prepare the environment for LLaVA here.

Dataset Preparation

1. Download the Datasets

LLaVA-665K Dataset

For the LLaVA dataset, please download the LLaVA-665K dataset following the instructions from the LLaVA GitHub repository. This dataset is used for visual instruction tuning and contains a diverse set of visual-language examples.

Vision-FLAN Dataset

For the Vision-FLAN dataset, please download the data from the Vision-FLAN website. This dataset provides a comprehensive collection of visual-language tasks for instruction tuning.

After downloading the datasets, please place all data files in the /datasets directory.

2. Preprocess the Dataset

We first add a unique index for each instruction in the original dataset, to better identify each sample:

python data_process/preprocess.py \
    --raw_annotation_path datasets/your_dataset.json \
    --new_annotation_save_path datasets/processed_dataset.json

This script adds a unique identifier to each sample in your dataset, which is essential for the data selection process. The processed dataset will be saved to the specified path. We will be using the json files with the unique_idx included in the code.

Please note that as stated in the paper, for the LLaVA-1.5 dataset we remove the text-only instructions from the data, as our method focuses on selecting the images. You can either remove them yourself or use the already processed json file here.

3. Task Splits

For our method, we need to split the dataset into different tasks. We provide the task splits used in our experiments:

Place the downloaded and unzipped task split files in the data/ directory.

4. Reference Model Training

To estimate task importance values, we need a reference model trained on a small randomly selected reference dataset. You have two options:

Option 1: Use Our Pre-selected Reference Datasets

For LLaVA-1.5 and Vision-FLAN datasets, you can directly use our randomly selected reference datasets (5% of images and their corresponding instructions from each task):

  • LLaVA-1.5 reference data (randomly selected 5% images with instructions): Download JSON
  • Vision-FLAN reference data (randomly selected 5% images with instructions): Download JSON

Place the downloaded JSON files in the data/ directory.

Option 2: Create Your Own Reference Dataset

For custom datasets, you'll need to create a reference dataset by randomly sampling 5% of images along with their corresponding instructions from each task.

After preparing the reference dataset, fine-tune a LLaVA-7B model on it to obtain the reference model. For this step:

Fine-tune the LLaVA-7B model huggingface using LoRA training following the script provided here

This reference model will be used in later steps to estimate task-importance values.

Usage

1. Loss/Perplexity Calculations

First, process the reference data to remove the question parts of the instructions:

python data_process/remove_instruction.py \
    --input_path /data/round1_665k_notext.json \
    --output_path /data/round1_665k_notext_img_token.json

This will create a new file (/data/round1_665k_notext_img_token.json).


Then run the loss/perplexity calculations twice:

python presel/loss_ppl_calc.py \
    --data_path /data/round1_665k_notext.json \
    --model_path /PATH/TO/REFERENCE_MODEL \
    --image_folder /datasets \
    --output_file /data/loss_ppl_round1_665k_notext.json
python presel/loss_ppl_calc.py \
    --data_path /data/round1_665k_notext_img_token.json \
    --model_path /PATH/TO/REFERENCE_MODEL \
    --image_folder /datasets \
    --output_file /data/loss_ppl_round1_665k_notext_img_token.json
  • Replace /PATH/TO/REFERENCE_MODEL with the path to your reference model checkpoint.
  • Adjust --image_folder and --output_file as needed for your setup.

2. Task Importance Estimation

Run the following to get the estimated task-importance values required for our data selection approach:

python presel/llava_task_importance.py \
    --data_w_path /data/loss_ppl_round1_665k_notext.json \
    --data_wo_path /data/loss_ppl_round1_665k_notext_img_token.json \
    --reference_data_path /data/round1_665k_notext.json \
    --task_files_dir /data \
    --output_dir /data

3. Pre-Instruction Data Selection

First, we extract the visual features using DINOv2 model for each task (1 to 10 for the LLaVA dataset):

python data_process/extract_feats_665_dino.py --task_num TASK_NUM

Then run k-means clustering and sample selection:

python data_process/kmeans_clust.py --method typical

Finally, run the following command to finetune the model on the selected data. Make sure to set the BASE_DIR value appropriately. This code implements multi-round training where each round has a budget of 5% of the total data. Note that the results reported in the main paper correspond to round 3 (15% budget).

python presel/data_selection.py \
    --base_dir BASE_DIR \
    --method presel \
    --dataset_type llava

Running on the Vision-FLAN Dataset

For the Vision-FLAN dataset, the steps are similar to those for the LLaVA-1.5 dataset mentioned above. For "Loss/Perplexity Calculations", you can follow the same steps, but make sure to adjust the code to match the Vision-FLAN data format (e.g., JSON files, reference set, image folder, etc.).

For "Task Importance Estimation", you can directly download the estimated task importance values here and place it in /data directory.

For "Pre-Instruction Data Selection", first use the same script, data_process/extract_feats_665_dino.py, to extract VF features. Save the output as /data/dino_feats_vf/dino_feats_all_vf.pt. Then, run

python data_process/kmeans_clust_vf.py --method typical

Finally, run the following command to fine-tune the model on the selected Vision-FLAN data:

python presel/data_selection.py \
    --base_dir BASE_DIR \
    --method presel \
    --dataset_type vision_flan \
    --file_path /datasets/annotation_191-task_1k_add_idx.json

Finetuned Models & Selected Data (15%)

You can find our selected 15% subset of data via PreSel, as well as the fine-tuned models trained on it here:

| Dataset | 15% Selected Data by PreSel (JSON) | LLaVA-7B Model Finetuned | |---------|-----------------------------------|--------------------------| | LLaVA-1.5 | Download | Download | | Vision-FLAN | Download | Download |

Evaluation

Please follow the original LLaVA page and VLMEvalKit to evaluate models.

Citation

If you find this codebase useful for your research, please cite our

Related Skills

View on GitHub
GitHub Stars17
CategoryDevelopment
Updated3mo ago
Forks1

Languages

Python

Security Score

90/100

Audited on Jan 1, 2026

No findings