SkillAgentSearch skills...

DHSA

[NeurIPS 25 @ ER] Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Install / Use

/learn @xiongsiheng/DHSA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

DHSA: Dynamic Hierarchical Sparse Attention

This repository contains the official implementation of Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs.

Table of Contents

Overview

LLMs face efficiency limits from the quadratic cost of dense attention. Static sparse methods (e.g., sliding windows, global tokens) reduce computation but cannot adapt to content, while dynamic ones still rely on heuristics. DHSA is a data-driven plug-in that predicts attention sparsity on the fly without retraining. It adaptively segments input into variable-length chunks, computes chunk-level similarities, and refines them into token-level importance scores for efficient long-context modeling.

<br> <p align="center"> <img src='https://raw.githubusercontent.com/xiongsiheng/DHSA/main/misc/Framework.png' width=650> </p> <br>

Key Features

  • Dynamic Boundary Prediction – Learns to segment input sequences into variable-length chunks based on semantic shifts.

  • Hierarchical Sparsity Prediction – Combines chunk-level and token-level attention estimation for efficient long-context modelling.

  • Plug-and-Play Integration – Works with existing Transformer layers without retraining base weights.

<br> <p align="center"> <img src='https://raw.githubusercontent.com/xiongsiheng/DHSA/main/misc/NIAH_gemma2.png' width=650> </p> <br>

Latency and Memory Comparison (Gemma2-2B, a single 24 GB GPU)

| Context Length | Method | Prefill Latency (s) | Prefill Peak Memory (GB) | |----------------|---------|--------------|------------------| | 8k | Dense | 1.65 | 10.72 | | | DHSA | 1.19 | 6.91 | | 16k | Dense | - | OOM | | | DHSA | 2.18 | 9.69 | | 32k | Dense | - | OOM | | | DHSA | 4.51 | 16.99 |

DHSA uses a 2k attention budget with a query chunk size of 256.

Repository Structure

DHSA/
├── boundary_predictor/
├── boundary_predictor_weights/
├── data/
├── results_longbench/
├── results_needle/
├── scripts/
├── utils/
├── eval_longbench.py
├── run_latency_test.py
├── run_longbench.py
├── run_needle_in_haystack.py
└── visualize_needle_in_haystack.py

Installation

# Clone the repository
git clone https://github.com/xiongsiheng/DHSA.git
cd DHSA

# Create and activate a conda environment with Python 3.12
conda create -n dhsa python=3.12 -y
conda activate dhsa

# Install dependencies
pip install -r requirements.txt

Quick Start

Boundary Prediction

You can directly download the predictor weights here.

If you wish to train from scratch, you can download the following datasets: Long Data Collections, trivia QA, ChatQA2, and use the training scripts provided in boundary_predictor/.

Needle-in-a-Haystack Test

To evaluate the model's ability to retrieve specific information from a long context, run:

bash scripts/run_needle_in_haystack.sh

To visualize the results, run:

bash scripts/run_visualize_needle_in_haystack.sh

Latency & Memory Comparison

To benchmark latency and memory usage against other methods, run:

bash scripts/run_latency_test.sh

LongBench

To test performance on the comprehensive LongBench suite, run:

bash scripts/run_longbench.sh

To evaluate the results, run:

bash scripts/run_eval_longbench_res.sh

Contact

If you have any inquiries, please feel free to raise an issue or reach out to sxiong45@gatech.edu.

Citation

@misc{xiong2025longcontextmodelingdynamichierarchical,
      title={Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs}, 
      author={Siheng Xiong and Joe Zou and Faramarz Fekri and Yae Jee Cho},
      year={2025},
      eprint={2510.24606},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.24606}, 
}
View on GitHub
GitHub Stars74
CategoryDevelopment
Updated1mo ago
Forks4

Languages

Python

Security Score

80/100

Audited on Feb 2, 2026

No findings