VulChecker
A deep learning model for localizing bugs in C/C++ source code (USENIX'23)
Install / Use
/learn @ymirsky/VulCheckerREADME
Overview
In this repository you will find a Python implementation of VulChecker; a tool for detecting vulnerabilties (CWE) in source code. From,
Mirsky Y, Macon G, Brown M, Yagemann C, Pruett M, Downing E, Mertoguno S, Lee W. "VulChecker: Graph-based Vulnerability Localization in Source Code", USENIX Security 23
If you use any derivative of this code in your work, please cite our publicaiton.
This implimentation supports cmake C/C++ projects only. It can be used to detect integer overflow (CWE-190), stack overflow (CWE-121), heap overflow (CWE-122), double free (CWE-415), and use-after-free (CWE-416) vulnerabilites.
What is VulChecker?
VulChecker is a tool that can precisely locate vulnerabilities in source code (down to the exact instruction) as well as classify their type (CWE). This is useful for developers to locate potential security risks in their code during development, even before the project is complete and deployed. The tool converts cmake C/C++ projects into a graph-based program representation called and ePDG. For each potential manifestation point in the project, a subgraph is extracted by crawling the ePDG up from the potential manifestation point. Finally, a graph-based neural network called Structure2Vec is used to classify which subgraphs yeild actual vaulnerabilites. This is repeated for each CWE resulting in seperate a classifiers. The figure below illustreates how Vulchecker works for a single CWE:
The tool also provides a means for data augmetation: Although many labeld samples are required to train a robust model, it is hard to aquire many line-level labeled samples of vulnerabilites from the wild. Therefore, the tool lets you augment the ePDGs of "clean" projects from the wild with the ePDGs of synthetic vulnerbility datasets. In our research, we found that this is enough to train a model to detect vulnerabilites in the wild. However, whenever possible, it is reccomeneded to include real vulnerabilites from the wild in the training data as well.
Contents
In this README you will find chapters on the following topics:
- Installation instructions
- Detailed usage instructions
- Assets: How to access the assets (datasets, models, VM)
- Developer Notes
- Acknowledgements
Installation
This tool uses a pipeline of many different components to go from a C/C++ project all the way to a predction from a deep learning model. For example, LLVM with a custom plugin is used to create the ePDGs with any provided labels. Setting up this pipeline is complex and takes a lot of time since LLVM must be compiled. Therefore, instead of performing a clean install (using the instructon below) we provide an Ubuntu VM with VulChecker preinstalled. On the VM's desktop you will find some demo scripts.
The VM can be downloaded from here. Username: vulchecker, Password: vulchecker
Clean Install
The following are instructions for a clean install on Linux (tested on Ubuntu 20.04 and python 3.8.10)
Quick Start
You can use the install script in this repository (demos/) as a guide. However, we reccomend that you read below for better instruction.
Components
VulChecker uses a number of components that must be installed. Here is a list of components of Vulchecker which we maintain in seperate repositories:
VulChecker: the core library for processing data and training models. All operations with this library are through a command line tool calledhector. https://github.com/ymirsky/VulChecker.gitLLAP: a plugin to LLVM for extracting ePDGs from cmake C\C++ projects. https://github.com/michaelbrownuc/llapStructure2Vec: our pyTorch implimentation of the graph-based neural network by Dai et al. https://github.com/gtri/structure2vecvulchecker-misc: a collection of helpful (optional) scripts, such as automatic labeling Juliet samples. https://github.com/michaelbrownuc/vulchecker-misc
Step 1: Install the Python Libraries
It is reccomended that you create and activate a python environment before installing any of the libraries to avoid conflicts.
First get VulChecker (hector) and Structure2Vec:
git clone https://github.com/ymirsky/VulChecker.git
git clone https://github.com/gtri/structure2vec.git
The structure2vec library uses networkx which required cmake to be installed on the system. If you don't have it, you should install it now. It is also reccomended to install python3-pip:
sudo apt install cmake
sudo apt install python3-pip
Now we can install the python libraries and Cython for optimized graph manipulation:
python3 -m pip install -U pip setuptools wheel
python3 -m pip install cython cmake
python3 -m pip install ./structure2vec
python3 -m pip --no-cache-dir install ./VulChecker
Check that VulChecker installed correctly by accessing the help option of the hector tool
~$ hector --help
Usage: hector [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
augmentation Augment a real-world program with Juliet...
compile_for_train
configure Configure a codebase to be analyzed by HECTOR.
cross_validation
feature_stats
hyperopt Optimize hyperparameters.
lint Lint-check a codebase using HECTOR.
predict
preprocess Preprocess Program Dependence Graphs.
sample_data Downsample manifestation points.
stats
train
train_test_split
validate_data
visualize
Note :memo:: be sure to use --help on the commands to get further otpions and hints. E.g., hector preprocess --help
Step 2: Get LLVM and ninja
Next we need to obtain v10.0.0 of the LLVM compiler and ninja to work with the source code.
Install ninja:
sudo apt-get install -y ninja-build
Download LLVM:
cd ~
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-10.0.0/llvm-project-10.0.0.tar.xz
tar xvf llvm-project-10.0.0.tar.xz
mv llvm-project-10.0.0 llvm-project
Step 3: Install LLVM with LLAP
Now we need to install the VulChecker plugin to the LLVM compiler (LLAP) which enables us to generate ePDGs from source code.
Download LLAP:
git clone https://github.com/michaelbrownuc/llap.git
Add the LLAP plugin to LLVM:
cp -R llap-master/src/* llvm-project/llvm/lib/Transforms/
Compile LLVM ...and go get some coffee :coffee:. It will take a while):
cd llvm-project/
cmake -S ./llvm/ -B llvm-build -DCMAKE_BUILD_TYPE=Release
make -C llvm-build -j 16
make -C llvm-build install
cmake -S ./clang/ -B clang-build -DCMAKE_BUILD_TYPE=Release
make -C clang-build -j 16
make -C clang-build install
Important ⚠️: When using the hector tool, you will be asked to provide the path to LLAP to execute certain commands. If you installed LLVM in the home dir (as above) then the path to LLAP is:
~/llvm-project/llvm-build/lib
Usage
VulChecker follows a pipeline approach consisting of three segments:
- Data Preparation
- Model Training
- Execution
Data Preperation involves (1) prepairing line-level labels for a C/C++ cmake project [optional], (2) converting the project into an ePDG using LLVM, (3) processing the ePDG by converting it into a collection of potential manifestaion point subgraphs, and (4) collecting the processed projects into singular dataset files [optional].
Model Training involves (1) extracting normalization parameters from the training dataset, (2) training a Structure2Vec model on the dataset, and (3) evaluating the model on a test set.
Execution involves (1) executing a trained model on a project and (2) aquiring the results. The project must be preprocessed first similar to the steps in Data Preparation.
You can execute each of these steps using our command line tool called hector.
Important ⚠️: When executing each part of the pipeline, you must indicate which CWE the final model will be detecting. This is because the features and manifestation points are different for each CWE. This means, if you want to use C++ project for all six CWEs then you will need to make six seperate ePDGs of the project, etc.
Running the Pipeline
Below is a detailed illustration of how the pipeline is used for a single CWE 'X':
In this dataflow diagram, we show how to (1) setup a training dataset that uses labels from a sythetic vulnerability dataset (e.g., Juliet), (2) evaluate the model on a labeled CVE dataset, and (3) execute other projects from the wild on the same model. Note, although not required, a good model will also use samples from the wild labeled vulnerabilites (and not just sythnthetic vulenrabilites).
As examples of how to execute parts of this pipeline, you can take a look at the demo scripts in this repo (demos/) which show you how to process data, train models and make predicitons. The demos are written to run on the provided VM.
We will now explain in detail how to perform each of these steps
(1) Collect Source Code
The first thing you need to do is collect C/C++ cmake projects for training and testing the model. You may already have a model (e.g., the ones we provide) and want execute them on new projects as well. The source code to each project should be in a seperate directory (e.g., cmake_proj/).
(2) Label the Projects
The projects you will be using for training and testing the model will have some labels. To label project cmake_proj/, you will need to make a file that indicates where the vulnerabilites manifest themselves in the source code. The labels file is a JSON array of objects (e.g., cmake_proj/labels.json).
Each object has three keys, filename, line_number
