SkillAgentSearch skills...

QTSplus

Query-aware Token Selector (QTSplus), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs.

Install / Use

/learn @Siyou-Li/QTSplus
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p> <h1> <img src="./assets/logo_with_glasses.svg" height=150px align="right"/> Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models </h1> </p>

arXiv Website HF

🚀 Introduction

Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (QTSplus), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs.

Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) predicting an instance-specific retention budget based on the complexity of the query, and (iii) selecting Top-n tokens. Furthermore, a small re-encoder preserves temporal order using absolute time information. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to 89% and reduces end-to-end latency by 28% on long videos.

QTSplus/
├── README.md, LICENSE, environment.sh
├── assets/                     # logos and figures for the project
├── config/                     # accelerate/deepspeed configs
├── evaluation/                 # inference & evaluation scripts
├── script/                     # example launch scripts
├── src/
│   ├── dataset/                # dataset loaders (VSCQ/VQA/LLaVA-Video-178K, etc.)
│   ├── model/                  # vision towers, QTS+ tokenizer, LM wrappers
│   ├── train/                  # training and fine‑tuning scripts
│   └── utils/                  # helpers (vision preprocessing, model split, etc.)
├── preprocess/                 # data-preprocessing utilities
└── verify/                     # smoke tests for models & pipelines

🚀 Quick Start

1. Download Pretrained Models

| Model | Download Link| |----------|----------| | QTSplus-3B | HuggingFace| | QTSplus-7B | HuggingFace| | QTSplus-3B-FT | HuggingFace|

2. Inference Demo

import os
import glob
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
from typing import Optional

# Function to build messages for video or image input
def build_messages(video: str | None, images_dir: str | None, prompt: str) -> list[dict]:
    if video:
        return [
            {
                "role": "user",
                "content": [
                    {"type": "video", "video": video, "max_pixels": 360 * 420, "fps": 1.0},
                    {"type": "text", "text": prompt or "Describe this video."},
                ],
            }
        ]
    if images_dir:
        image_list = sorted(glob.glob(os.path.join(images_dir, "*.jpeg")))
        if not image_list:
            image_list = sorted(glob.glob(os.path.join(images_dir, "*.jpg")))
        return[
                {
                    "role": "user",
                    "content": [
                        {"type": "video", "video": image_list},
                        {"type": "text", "text": prompt or "What is in these images?"},
                    ],
                }
            ]
    else:
        raise ValueError("Either video path or images directory must be provided.")

# Input Example
question = "What is happening in the video?"
video_path = "path/to/video.mp4"  # Set to None if using images
images_dir = None  

# Load model and processor
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float16

model = AutoModelForCausalLM.from_pretrained(
    "AlpachinoNLP/QTSplus-3B",
    trust_remote_code=True,
    local_files_only=True,
).to(dtype=dtype, device=device)

model.eval()

processor = AutoProcessor.from_pretrained(
    "AlpachinoNLP/QTSplus-3B", trust_remote_code=True, local_files_only=True
)

# Build messages for the input video or images
messages = build_messages(video_path, images_dir, question)
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

inputs = processor(
    text=[text],
    images=None,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to(dtype=torch.float16, device=device)

# Extract and format the vision input for QTS+ model
pixel_values_videos = inputs.pop('pixel_values_videos', None)
video_grid_thw = inputs.pop('video_grid_thw', None)
inputs.pop('second_per_grid_ts', None)  # Remove unused parameter

# Format vision input as expected by QTS+ model
vision_input = None
if pixel_values_videos is not None and video_grid_thw is not None:
    vision_input = {
        'pixel_values_videos': pixel_values_videos,
        'video_grid_thw': video_grid_thw
    }
print("="*40)
# Build question_input_ids from the textual question only (avoid including system/vision tokens)
question_ids = processor.tokenizer(
    question,
    return_tensors="pt",
    add_special_tokens=False,
).input_ids.to(dtype=torch.long, device=device)

# Inference
generated_ids = model.generate(
    vision_input=vision_input,
    input_ids=inputs.input_ids,
    question_input_ids=question_ids,
    max_new_tokens=256,
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
# Fallback: if trimming logic yields empty text (common when using inputs_embeds),
# decode the full sequences instead.
output_text = [
    txt if (txt is not None and txt.strip() != "") else processor.decode(ids, skip_special_tokens=True)
    for txt, ids in zip(output_text, generated_ids)
]
print(output_text[0])
print("="*40)

💿 Data

QTSplus-Dataset is designed to improve video understanding capabilities through three hierarchical datasets:

  • QTS-VSCQ1: A large dataset of visual single-choice questions synthesized by a text-only model (Qwen3-235B)
  • QTS-VSCQ2: A curated subset of QTS-VSCQ1, containing only questions that a vision-language model (Qwen2.5-VL) answers correctly
  • QTS-VQA: Free-form answers generated by a vision-language model for the questions in QTS-VSCQ2

Question Types

QTS-VSCQ1 includes 9 distinct question types to provide comprehensive coverage of video understanding capabilities:

| Type | Description | Example | |------|-------------|---------| | object_identity | Identifying objects present in the scene | "What object is visible on the table?" | | attribute_color_material_shape | Attributes of objects (color, material, shape) | "What color is the person's shirt?" | | text_in_scene | Text visible in the video | "What does the sign say?" | | count_quantity | Counting objects or people | "How many people are in the scene?" | | action_activity | Actions or activities being performed | "What activity is being performed?" | | setting_location | Location or setting of the scene | "Where does this scene take place?" | | temporal_order | Order of events or actions | "What happens first in the sequence?" | | person_attribute | Attributes of people | "What is the person wearing?" | | cause_effect_or_purpose | Causal relationships or purposes | "Why is the person doing this action?" |

Each question is assigned a difficulty level (easy, medium, or hard) based on the complexity of reasoning required.

Dataset Statistics

The QTSplus dataset includes three main components:

  • QTS-VSCQ1: Over 855,000 multiple-choice questions derived from video scene descriptions
  • QTS-VSCQ2: A curated subset of QTS-VSCQ1 containing only questions that Qwen2.5-VL answers correctly
    • 3B Model: 759,650 correct examples (train), 4,486 correct examples (eval), 89,851 wrong examples (train)
    • 7B Model: 771,218 correct examples (train), with improved accuracy (76.56% vs 22.24% for 3B)
  • QTS-VQA: Free-form answers generated for QTS-VSCQ2 questions
    • 3B Model: 544,138 correct examples (train), 342 wrong examples (train)
    • 7B Model: 399,548 correct examples (train), providing longer and more detailed answers

The dataset features:

  • Balanced distribution across answer choices (A, B, C, D at ~25% each)
  • Difficulty distribution: ~59% easy, ~40% medium, <1% hard questions
  • Question length: Average 58-59 characters per question
  • Evidence-grounded answers with explicit text support
  • For VQA: Free-form answers averaging 145-220 characters depending on model size

🚄 Training

A. Setup

The repository is designed around a conda‑based Python 3.11 environment with a CUDA‑enabled GPU. The commands below are taken directly from environment.sh and provide a reproducible setup on recent Linux distributions.

  1. Create and activate the conda environment
conda create -n qtsplus python=3.11 -y
conda activate qtsplus
  1. **Install toolchain and CUDA to

Related Skills

View on GitHub
GitHub Stars130
CategoryContent
Updated5d ago
Forks9

Languages

Python

Security Score

95/100

Audited on Mar 31, 2026

No findings