Overview

In this repository you will find a Python implementation of VulChecker; a tool for detecting vulnerabilties (CWE) in source code. From,

Mirsky Y, Macon G, Brown M, Yagemann C, Pruett M, Downing E, Mertoguno S, Lee W. "VulChecker: Graph-based Vulnerability Localization in Source Code", USENIX Security 23

If you use any derivative of this code in your work, please cite our publicaiton.

This implimentation supports cmake C/C++ projects only. It can be used to detect integer overflow (CWE-190), stack overflow (CWE-121), heap overflow (CWE-122), double free (CWE-415), and use-after-free (CWE-416) vulnerabilites.

What is VulChecker?

VulChecker is a tool that can precisely locate vulnerabilities in source code (down to the exact instruction) as well as classify their type (CWE). This is useful for developers to locate potential security risks in their code during development, even before the project is complete and deployed. The tool converts cmake C/C++ projects into a graph-based program representation called and ePDG. For each potential manifestation point in the project, a subgraph is extracted by crawling the ePDG up from the potential manifestation point. Finally, a graph-based neural network called Structure2Vec is used to classify which subgraphs yeild actual vaulnerabilites. This is repeated for each CWE resulting in seperate a classifiers. The figure below illustreates how Vulchecker works for a single CWE:

The tool also provides a means for data augmetation: Although many labeld samples are required to train a robust model, it is hard to aquire many line-level labeled samples of vulnerabilites from the wild. Therefore, the tool lets you augment the ePDGs of "clean" projects from the wild with the ePDGs of synthetic vulnerbility datasets. In our research, we found that this is enough to train a model to detect vulnerabilites in the wild. However, whenever possible, it is reccomeneded to include real vulnerabilites from the wild in the training data as well.

In this README you will find chapters on the following topics:

Installation instructions
Detailed usage instructions
Assets: How to access the assets (datasets, models, VM)
Developer Notes
Acknowledgements

Installation

This tool uses a pipeline of many different components to go from a C/C++ project all the way to a predction from a deep learning model. For example, LLVM with a custom plugin is used to create the ePDGs with any provided labels. Setting up this pipeline is complex and takes a lot of time since LLVM must be compiled. Therefore, instead of performing a clean install (using the instructon below) we provide an Ubuntu VM with VulChecker preinstalled. On the VM's desktop you will find some demo scripts.

The VM can be downloaded from here. Username: vulchecker, Password: vulchecker

Clean Install

The following are instructions for a clean install on Linux (tested on Ubuntu 20.04 and python 3.8.10)

Quick Start

You can use the install script in this repository (demos/) as a guide. However, we reccomend that you read below for better instruction.

Components

VulChecker uses a number of components that must be installed. Here is a list of components of Vulchecker which we maintain in seperate repositories:

VulChecker: the core library for processing data and training models. All operations with this library are through a command line tool called hector. https://github.com/ymirsky/VulChecker.git
LLAP: a plugin to LLVM for extracting ePDGs from cmake C\C++ projects. https://github.com/michaelbrownuc/llap
Structure2Vec: our pyTorch implimentation of the graph-based neural network by Dai et al. https://github.com/gtri/structure2vec
vulchecker-misc: a collection of helpful (optional) scripts, such as automatic labeling Juliet samples. https://github.com/michaelbrownuc/vulchecker-misc

Step 1: Install the Python Libraries

It is reccomended that you create and activate a python environment before installing any of the libraries to avoid conflicts.

First get VulChecker (hector) and Structure2Vec:

git clone https://github.com/ymirsky/VulChecker.git
git clone https://github.com/gtri/structure2vec.git

The structure2vec library uses networkx which required cmake to be installed on the system. If you don't have it, you should install it now. It is also reccomended to install python3-pip:

sudo apt install cmake
sudo apt install python3-pip

Now we can install the python libraries and Cython for optimized graph manipulation:

python3 -m pip install -U pip setuptools wheel
python3 -m pip install cython cmake
python3 -m pip install ./structure2vec 
python3 -m pip --no-cache-dir install ./VulChecker

Check that VulChecker installed correctly by accessing the help option of the hector tool

~$ hector --help

Usage: hector [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  augmentation       Augment a real-world program with Juliet...
  compile_for_train
  configure          Configure a codebase to be analyzed by HECTOR.
  cross_validation
  feature_stats
  hyperopt           Optimize hyperparameters.
  lint               Lint-check a codebase using HECTOR.
  predict
  preprocess         Preprocess Program Dependence Graphs.
  sample_data        Downsample manifestation points.
  stats
  train
  train_test_split
  validate_data
  visualize

Note :memo:: be sure to use --help on the commands to get further otpions and hints. E.g., hector preprocess --help

Step 2: Get LLVM and ninja

Next we need to obtain v10.0.0 of the LLVM compiler and ninja to work with the source code.

Install ninja:

sudo apt-get install -y ninja-build

Download LLVM:

cd ~
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-10.0.0/llvm-project-10.0.0.tar.xz
tar xvf llvm-project-10.0.0.tar.xz
mv llvm-project-10.0.0 llvm-project

Step 3: Install LLVM with LLAP

Now we need to install the VulChecker plugin to the LLVM compiler (LLAP) which enables us to generate ePDGs from source code.

Download LLAP:

git clone https://github.com/michaelbrownuc/llap.git

Add the LLAP plugin to LLVM:

cp -R llap-master/src/* llvm-project/llvm/lib/Transforms/

Compile LLVM ...and go get some coffee :coffee:. It will take a while):

cd llvm-project/
cmake -S ./llvm/ -B llvm-build -DCMAKE_BUILD_TYPE=Release
make -C llvm-build -j 16
make -C llvm-build install 
cmake -S ./clang/ -B clang-build -DCMAKE_BUILD_TYPE=Release
make -C clang-build -j 16
make -C clang-build install

Important ⚠️: When using the hector tool, you will be asked to provide the path to LLAP to execute certain commands. If you installed LLVM in the home dir (as above) then the path to LLAP is:

~/llvm-project/llvm-build/lib

Usage

VulChecker follows a pipeline approach consisting of three segments:

Data Preparation
Model Training
Execution

Data Preperation involves (1) prepairing line-level labels for a C/C++ cmake project [optional], (2) converting the project into an ePDG using LLVM, (3) processing the ePDG by converting it into a collection of potential manifestaion point subgraphs, and (4) collecting the processed projects into singular dataset files [optional].

Model Training involves (1) extracting normalization parameters from the training dataset, (2) training a Structure2Vec model on the dataset, and (3) evaluating the model on a test set.

Execution involves (1) executing a trained model on a project and (2) aquiring the results. The project must be preprocessed first similar to the steps in Data Preparation.

You can execute each of these steps using our command line tool called hector.

Important ⚠️: When executing each part of the pipeline, you must indicate which CWE the final model will be detecting. This is because the features and manifestation points are different for each CWE. This means, if you want to use C++ project for all six CWEs then you will need to make six seperate ePDGs of the project, etc.

Running the Pipeline

Below is a detailed illustration of how the pipeline is used for a single CWE 'X':

In this dataflow diagram, we show how to (1) setup a training dataset that uses labels from a sythetic vulnerability dataset (e.g., Juliet), (2) evaluate the model on a labeled CVE dataset, and (3) execute other projects from the wild on the same model. Note, although not required, a good model will also use samples from the wild labeled vulnerabilites (and not just sythnthetic vulenrabilites).

As examples of how to execute parts of this pipeline, you can take a look at the demo scripts in this repo (demos/) which show you how to process data, train models and make predicitons. The demos are written to run on the provided VM.

We will now explain in detail how to perform each of these steps

(1) Collect Source Code

The first thing you need to do is collect C/C++ cmake projects for training and testing the model. You may already have a model (e.g., the ones we provide) and want execute them on new projects as well. The source code to each project should be in a seperate directory (e.g., cmake_proj/).

(2) Label the Projects

The projects you will be using for training and testing the model will have some labels. To label project cmake_proj/, you will need to make a file that indicates where the vulnerabilites manifest themselves in the source code. The labels file is a JSON array of objects (e.g., cmake_proj/labels.json). Each object has three keys, filename, line_number

VulChecker

Install / Use

README