SkillAgentSearch skills...

Devign

Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks

Install / Use

/learn @epicosy/Devign
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Devign

Implementation of Devign Model in Python with code for processing the dataset and generation of Code Property Graphs.

This project is under development. For now, just the Abstract Syntax Tree is considered for the graph embedding of code and model training.

Table of Contents

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Install the necessary dependencies before running the project: <br/>

Software:
Python Libraries:

Notes


These notes might save you some time:

  • Changes to the configs.json structure need to be reflected in the configs.py script.
  • PyTorch Geometric has several dependencies that need to match, including PyTorch. Follow the installation steps on their website.
  • Joern processing might be slow and even freeze your OS, that depends on your system's specifications. Choose a smaller size for the chunks that are processed when splitting the dataset during the Create task. That can be done by changing the "slice_size" value under "create" in the configurations file configs.json
  • In the "slice_size" file, the nodes are filtered and discarded if the size is greater than the limit configured.
  • When changing the number of nodes considered for processing, "nodes_dim" under "embed" needs to match "in_channels", under "devign" -> "model" -> "conv_args" -> "conv1d_1".
  • The embedding size is equal to Word2Vec vector size plus 1.
  • When executing the Create task, a directory named joern is created and deleted automatically under 'project'\data\.
  • The dataset split for modeling during Process task is done under src/data/datamanger.py. The sets are balanced and the train/val/test ratio are 0.8/0.1/0.1 respectively.
  • The script graph-for-funcs.sc queries the CPG graphs from Joern. That script has a minor change to make it possible to track the files to the CPGs generated. The last time was failing because dependencies in Joern changed and needed the updated version. I assume you can find it in their latest version. I suggested you look at issue #3. Those CPGs are saved in a JSON file, check function "joern_create" line 48, it prints the CPG created in Joern to a JSON file ... .toString() |> "{json_out}",  and that file is processed by the function "json_process". Both those functions are in the file devign/src/prepare/cpg_generator.py. If you have troubles creating the CPG JSON file with Joern, I advise you to do what you are trying manually in Joern. Create a new project pointing to the dataset folder containing all the files and query the CPG with the  graph-for-funcs.sc script that's built-in, then export it to a file with .toString() |>. Joern commands are quite easy to understand and they have good support on Gitter. As well, follow the commit to understand the changes I've previously made.
  • Tested on Ubuntu 18.04/19.04

Setup


For now this project is not pip installable. With the proper use cases will be implemented.

This section gives the steps, explanations and examples for getting the project running.

1) Clone this repo

$ git clone https://github.com/epicosy/devign/devign.git

2) Install Prerequisites

3) Configure the project

Verify you have the correct directory structure by matching with the "paths" in the configurations file configs.json. The dataset related files that are generated are saved under those paths.

4) Joern

This step is only necessary for the Create task. Follow the instructions on Joern's documentation page and install Joern's command line tools under 'project'\joern\joern-cli\ . <br/>

Structure


├── LICENSE
├── README.md                       <- The top-level README for developers using this project.
├── data
│   ├── cpg                         <- Dataset with CPGs.
│   ├── input                       <- Cannonical dataset for modeling.
│   ├── model                       <- Trained models.
│   ├── raw                         <- The original, immutable data dump.
│   ├── tokens                      <- Tokens dataset files generated from the raw data functions.
│   └── w2v                         <- Word2Vec model files for initial embeddings.
│
├── joern
│   ├── joern-cli                   <- Joern command line tools for creating and analyzing code property graphs.
│   └── graphs-for-funcs.sc         <- Script that returns in Json format the AST, CGF, and PDG for each method 
│                                       contained in the loaded CPG.
│
├── src                             <- Source code for use in this project.
│   ├── __init__.py                 <- Makes src a Python package.
│   │
│   ├── data                        <- Data handling scripts.
│   │   ├── __init__.py             <- Makes data a Python package.
│   │   └── datamanger.py           <- Module for the most essential operations on the dataset.
│   │
│   ├── prepare                     <- Package for CPG generation and representation.
│   │   ├── __init__.py             <- Makes prepare a Python package.
│   │   ├── cpg_client_wrapper.py   <- Simple class wrapper for the CpgClient that interacts with the Joern REST server 
│   │   ├── cpg_generator.py        <- Ad-hoc script for creating CPGs with Joern and processing the results.
│   │   └── embeddings.py           <- Module that embeds the graph nodes into node features.
│   │
│   ├── process                     <- Scripts for modeling and predictions.
│   │   ├── __init__.py             <- Makes process a Python package.
│   │   ├── devign.py               <- Module that implements the devign model.
│   │   ├── loader_step.py          <- Module for one epoch iteration over dataset
│   │   ├── model.py                <- Module that implements the devign neural network.
│   │   ├── modeling.py             <- Module for training and prediction the model.
│   │   ├── step.py                 <- Module that performs a forward step on a batch for train/val/test loop.
│   │   └── stopping.py             <- Module that performs early stopping.
│   │
│   │
│   └── utils                       <- Package with helper components like functions and classes, used across 
│       │                              the project.
│       ├── __init__.py             <- Makes utils a Python package.
│       ├── log.py                  <- Module for logging modules messages.
│       ├── functions               <- Auxiliar functions for processing.
│       │   ├── __init__.py         <- Makes functions a Python package
│       │   ├── cpg.py              <- Module with auxiliar functions for CPGs.
│       │   ├── digraph.py          <- Module for creating digraphs from nodes.
│       │   └── parase.py           <- Module for parsing source code into tokens.
│       │ 
│       └── objects                 <- Auxiliar data classes with basic methods.
│           ├── __init__.py         <- Makes objects a Python package.
│           ├── cpg                 <- Auxiliar data classes for representing and handling the Json graphs.
│           │   ├── __init__.py     
│           │   ├── ast.py
│           │   ├── edge.py
│           │   ├── function.py
│           │   ├── node.py
│           │   └── properties.py
│           │
│           ├── input_dataset.py    <- Custom wrapper for Torch Dataset.
│           ├── metrics.py          <- Module for evaluating the results.
│           └── stats.py            <- Module for handling raw results.
│       
│
├── configs.py                      <- Configuration management script.
├── configs.json                    <- Project configurations used by main.py. 
└── main.py                         <- Main script file that joins the modules into executable tasks. 

##Usage

Dataset


The dataset used is the partial dataset released by the authors. The dataset is handled with Pandas and the file src/data/datamanger.py contains wrapper functions for the most essential operations. <br/> <br/> A small sample of 994 entries from the original dataset is available for testing purposes. The sample dataset contains functions from the FFmpeg project with a maximum of 287 nodes per function. For each task, the necessary dataset files are available under the respective folders. <br/> <br/> For example, under data/cpg are available the datasets with the graphs constituting the CPG for the functions.

Fields

|project| commit_id | target | func | |------:|:----------------------------------------:| ------:| -------------------------------------

Related Skills

View on GitHub
GitHub Stars257
CategoryEducation
Updated7d ago
Forks83

Languages

Python

Security Score

95/100

Audited on Apr 2, 2026

No findings