SrcMarker: Dual-Channel Source Code Watermarking via Scalable Code Transformations

This repository provides the script to reproduce the major experiments in the paper SrcMarker: Dual-Channel Source Code Watermarking via Scalable Code Transformations (To appear in IEEE S&P 2024)

Overview

Table of contents
Overview of this repository
Getting Started
Running the Experiments
- Training
- Evaluations
- MutableAST benchmark
Citation

Overview of this repository

Training and evaluation scripts
- Please refer to Running the Experiment section for instructions on running these files.
- script_*.sh. Shell scripts to reproduce the experiments.
- train_main.py. Main Python training scripts.
- eval_*.py. Various Python evaluation scripts.
Training procedures
- metrics/ contains metrics used in our experiments, mainly migrated from CodeXGLUE.
- models/ contains a few implementations of our models (Transformer and GRU).
- trainers/ contains the main training procedure, wrapped as trainer classes.
Code transformations
- mutable_tree/ contains our implementation of MutableAST. MutableAST is also available at this GitHub repository.
- natgen_transformer/ contains the transformations in NatGen. We migrate their implementation from NatGen's GitHub repository.
- ropgen_transformer/ contains the transformations in RopGen. We migrate their implementation from RopGen's GitHub repository.
Other files include data pre-processing and utility functions.

Setting up the Environment

Installing Python Packages and Dependencies

You will (of course) need Python to execute the code

Python 3 (Python 3.10)

The following packages are required to run the main training and evaluation scripts

PyTorch (PyTorch 1.12)
tree-sitter
Huggingface Transformers
tqdm
inflection
sctokenizer

The following packages are optional, only required by certain experiment scripts

SrcML
- only required if running the transform pipeline provided by RopGen

Building tree-sitter Parsers

:warning: Please use tree-sitter==0.20 or tree-sitter==0.21. Newer versions of tree-sitter introduced breaking changes that are incompatible with our code.

We use tree-sitter for MutableAST construction, syntax checking and codebleu computation. Follow the steps below to build a parser for tree-sitter.

Notice that our current implementation of MutableAST is based on specific versions of tree-sitter parsers. The latest tree-sitter parsers might have updated their grammar, which could be incompatible with MutableAST. Therefore please checkout to the commits as is specified in the shell script below, or otherwise MutableAST might break.

# create a directory to store sources
mkdir tree-sitter
cd tree-sitter

# clone parser repositories
git clone https://github.com/tree-sitter/tree-sitter-java.git
cd tree-sitter-java
git checkout 6c8329e2da78fae78e87c3c6f5788a2b005a4afc
cd ..

git clone https://github.com/tree-sitter/tree-sitter-cpp.git
cd tree-sitter-cpp
git checkout 0e7b7a02b6074859b51c1973eb6a8275b3315b1d
cd ..

git clone https://github.com/tree-sitter/tree-sitter-javascript.git
cd tree-sitter-javascript
git checkout f772967f7b7bc7c28f845be2420a38472b16a8ee
cd ..

# go back to parent dir
cd ..

# run python script to build the parser
python build_treesitter_langs.py ./tree-sitter

# the built parser will be put under ./parser/languages.so

Datasets

GitHub-C and GitHub-Java

The pre-processed datasets for GitHub-C and GitHub-Java (originally available here) are included in this repository, which can be found in ./datasets/github_c_funcs and ./datasets/github_java_funcs.

MBXP

The MBXP datasets are originally available at amazon-science/mxeval. The filtered MBXP datasets used in our project is also included in this repository.

CodeSearchNet

The CSN datasets are available on the project site of CodeSearchNet. Since CSN datasets are relatively large, they are not included here. Follow the steps below to further process the dataset after downloading it.

Follow the instructions on CodeXGLUE (code summarization task) to filter the dataset.
Run dataset_filter.py to filter out samples with grammar errors or unsupported features.

python dataset_filter.py java <path_to_your_csn_jsonl>

The results will be stored as filename_filtered.jsonl, rename it into train.jsonl or valid.jsonl or test.jsonl depending on the split and put the three files under ./datasets/csn_java or ./datasets/csn_js.

Wrapping Up

After all datasets are processed, the final directory should look like this

- datasets
    - github_c_funcs
        - train.jsonl
        - valid.jsonl
        - test.jsonl
    - github_java_funcs
        - train.jsonl
        - valid.jsonl
        - test.jsonl
    - csn_java
        - train.jsonl
        - valid.jsonl
        - test.jsonl
    - csn_js
        - train.jsonl
        - valid.jsonl
        - test.jsonl
    - mbcpp
        - test.jsonl
    - mbjp
        - test.jsonl
    - mbjsp
        - test.jsonl

Note that you should ensure the dataset directory is identical to the structure listed above, or otherwise the data-loading modules would not be able to correctly locate the datasets.

Preprocessing

Several preprocessing steps are required before running training or evaluation scripts. These preprocessing steps provide metadata for the subsequent training/evaluation scripts.

Collect variable names from datasets. The script will scan through all functions in the dataset and collect their substitutable variable names (local variables and formal parameters). Results will be stored in ./datasets/variable_names_<dataset>.json

# python collect_variable_names_jsonl.py <dataset>
python collect_variable_names_jsonl.py csn_js
python collect_variable_names_jsonl.py csn_java
python collect_variable_names_jsonl.py github_c_funcs
python collect_variable_names_jsonl.py github_java_funcs
python collect_variable_names_jsonl.py mbcpp
python collect_variable_names_jsonl.py mbjp
python collect_variable_names_jsonl.py mbjsp

Collect all feasible transforms. The script will enumerate all feasible transformation combinations for each function in the dataset. Results will be stored in ./datasets/feasible_transforms_<dataset>.json and tarnsforms_per_file_<dataset>.json. Note that this process could take a while, especially for CSN-Java.

# python collect_feasible_transforms_jsonl.py <dataset>
python collect_feasible_transforms_jsonl.py csn_js
python collect_feasible_transforms_jsonl.py csn_java
python collect_feasible_transforms_jsonl.py github_c_funcs
python collect_feasible_transforms_jsonl.py github_java_funcs
python collect_feasible_transforms_jsonl.py mbcpp
python collect_feasible_transforms_jsonl.py mbjp
python collect_feasible_transforms_jsonl.py mbjsp

Running the Experiments

Training
Evaluations
MutableAST benchmark

Training

train_main.py is responsible for all training tasks. Refer to the parse_args() function in train_main.py for more details on the arguments.

Here are some examples.

# training a 4-bit GRU model on CSN-Java
python train_main.py \
    --lang=java \
    --dataset=csn_java \
    --dataset_dir=./datasets/csn_java \
    --n_bits=4 \
    --epochs=25 \
    --log_prefix=4bit_gru_srcmarker \
    --batch_size 64 \
    --model_arch=gru \
    --shared_encoder \
    --varmask_prob 0.5 \
    --seed 42

# training a 4-bit Transformer model on CSN-JavaScript
python train_main.py \
    --lang=javascript \
    --dataset=csn_js \
    --dataset_dir=./datasets/csn_js \
    --n_bits=4 \
    --epochs=25 \
    --log_prefix=4bit_transformer_srcmarker \
    --batch_size 64 \
    --model_arch=transformer \
    --shared_encoder \
    --varmask_prob 0.5 \
    --seed 42

Alternatively, you can also use the script_train.sh to conveniently start training. However, you might have to manually modify some of the arguments in it.

# by default, it trains a 4-bit GRU model on a designated dataset,
# you have to manually change some of the variables in the script to run different tasks
# such as model architectures and checkpoint names
# source script_train.sh <dataset>
source script_train.sh csn_java

Checkpoints will be saved in ./ckpts.

Evaluation

Main Evaluation Script

eval_main.py is reponsible for most of the evaluation tasks.

SrcMarker

Install / Use

README