VulBench

This is a benchmark for evaluating the vulnerability discovery ability of automated approaches including Large Language Models (LLMs), deep learning methods and static analyzers

Generate Convert Improve

Install / Use

/learn @Hustcw/VulBench

About this skill

Quality Score

0/100

README

VulBench

This repository contains materials for the paper "How Far Have We Gone in Vulnerability Detection Using Large Language Model".

Usage

Suppose there is OpenAI-compatiable endpoint at http://localhost:8080, you can use the following command to evaluate the model.

# Alternatively, set OPENAI_API_KEY environment variable
export OPENAI_API_KEY=xxx
python3 query.py \
  --datasets d2a ctf magma big-vul devign \
  --db_path ./query_result.db \
  --api_endpoint http://localhost:8080 \
  --model 'Llama-2-7b-chat-hf' \
  --trials 5 \
  --do_binary_classification \
  --do_multiple_classification \
  --concurrency 100

Specifically, it will query the model Llama-2-7b-chat-hf with the datasets d2a, ctf, magma, big-vul, and devign. The results will be stored in the SQLite database query_result.db. The script will run 5 trials for each dataset and the script will send 100 requests concurrently.

After that, evaluate the result database with the following command.

python3 eval.py ./query_result.db [./query_result_1.db, ./query_result_2.db, ...]

It will generate a all_result.csv file, containing the different metrics for each dataset with different prompt type.

Extra Data

The compiled binaries (including fixed and unfixed, compiled with -g -fno-inline-functions -O2 to better represent the original individual vulnerable functions) and source code (with MAGMA_ENABLE_FIXES and MAGMA_ENABLE_CANARIES left untouched) are available at VulBench.

News

[2023/12/18] We release the raw datasets of VulBench.
[2024/3/19] We release the code for querying the model and evaluating the results.

Bibtex

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{gao2023far,
  title={How Far Have We Gone in Vulnerability Detection Using Large Language Models},
  author={Gao, Zeyu and Wang, Hao and Zhou, Yuchen and Zhu, Wenyu and Zhang, Chao},
  journal={arXiv preprint arXiv:2311.12420},
  year={2023}
}

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

flutter-tutor

Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d

Hustcw

View profile

View on GitHub

GitHub Stars79

CategoryEducation

Updated28d ago

Forks10

Hustcw/VulBench

Languages

Security Score

95/100

Audited on Mar 9, 2026

No findings