SrcMarker
Code for paper "SrcMarker: Dual-Channel Source Code Watermarking via Scalable Code Transformations" (IEEE S&P 2024)
Install / Use
/learn @YBRua/SrcMarkerREADME
SrcMarker: Dual-Channel Source Code Watermarking via Scalable Code Transformations
This repository provides the script to reproduce the major experiments in the paper SrcMarker: Dual-Channel Source Code Watermarking via Scalable Code Transformations (To appear in IEEE S&P 2024)
Overview
- Table of contents
- Overview of this repository
- Getting Started
- Running the Experiments
- Citation
Overview of this repository
- Training and evaluation scripts
- Please refer to Running the Experiment section for instructions on running these files.
script_*.sh. Shell scripts to reproduce the experiments.train_main.py. Main Python training scripts.eval_*.py. Various Python evaluation scripts.
- Training procedures
metrics/contains metrics used in our experiments, mainly migrated from CodeXGLUE.models/contains a few implementations of our models (Transformer and GRU).trainers/contains the main training procedure, wrapped as trainer classes.
- Code transformations
mutable_tree/contains our implementation of MutableAST. MutableAST is also available at this GitHub repository.natgen_transformer/contains the transformations in NatGen. We migrate their implementation from NatGen's GitHub repository.ropgen_transformer/contains the transformations in RopGen. We migrate their implementation from RopGen's GitHub repository.
- Other files include data pre-processing and utility functions.
Getting Started
Setting up the Environment
Installing Python Packages and Dependencies
You will (of course) need Python to execute the code
- Python 3 (Python 3.10)
The following packages are required to run the main training and evaluation scripts
- PyTorch (PyTorch 1.12)
- tree-sitter
- Huggingface Transformers
- tqdm
- inflection
- sctokenizer
The following packages are optional, only required by certain experiment scripts
- SrcML
- only required if running the transform pipeline provided by RopGen
Building tree-sitter Parsers
:warning: Please use tree-sitter==0.20 or tree-sitter==0.21. Newer versions of tree-sitter introduced breaking changes that are incompatible with our code.
We use tree-sitter for MutableAST construction, syntax checking and codebleu computation. Follow the steps below to build a parser for tree-sitter.
Notice that our current implementation of MutableAST is based on specific versions of tree-sitter parsers. The latest tree-sitter parsers might have updated their grammar, which could be incompatible with MutableAST. Therefore please checkout to the commits as is specified in the shell script below, or otherwise MutableAST might break.
# create a directory to store sources
mkdir tree-sitter
cd tree-sitter
# clone parser repositories
git clone https://github.com/tree-sitter/tree-sitter-java.git
cd tree-sitter-java
git checkout 6c8329e2da78fae78e87c3c6f5788a2b005a4afc
cd ..
git clone https://github.com/tree-sitter/tree-sitter-cpp.git
cd tree-sitter-cpp
git checkout 0e7b7a02b6074859b51c1973eb6a8275b3315b1d
cd ..
git clone https://github.com/tree-sitter/tree-sitter-javascript.git
cd tree-sitter-javascript
git checkout f772967f7b7bc7c28f845be2420a38472b16a8ee
cd ..
# go back to parent dir
cd ..
# run python script to build the parser
python build_treesitter_langs.py ./tree-sitter
# the built parser will be put under ./parser/languages.so
Datasets
GitHub-C and GitHub-Java
The pre-processed datasets for GitHub-C and GitHub-Java (originally available here) are included in this repository, which can be found in ./datasets/github_c_funcs and ./datasets/github_java_funcs.
MBXP
The MBXP datasets are originally available at amazon-science/mxeval. The filtered MBXP datasets used in our project is also included in this repository.
CodeSearchNet
The CSN datasets are available on the project site of CodeSearchNet. Since CSN datasets are relatively large, they are not included here. Follow the steps below to further process the dataset after downloading it.
- Follow the instructions on CodeXGLUE (code summarization task) to filter the dataset.
- Run
dataset_filter.pyto filter out samples with grammar errors or unsupported features.
python dataset_filter.py java <path_to_your_csn_jsonl>
The results will be stored as filename_filtered.jsonl, rename it into train.jsonl or valid.jsonl or test.jsonl depending on the split and put the three files under ./datasets/csn_java or ./datasets/csn_js.
Wrapping Up
After all datasets are processed, the final directory should look like this
- datasets
- github_c_funcs
- train.jsonl
- valid.jsonl
- test.jsonl
- github_java_funcs
- train.jsonl
- valid.jsonl
- test.jsonl
- csn_java
- train.jsonl
- valid.jsonl
- test.jsonl
- csn_js
- train.jsonl
- valid.jsonl
- test.jsonl
- mbcpp
- test.jsonl
- mbjp
- test.jsonl
- mbjsp
- test.jsonl
Note that you should ensure the dataset directory is identical to the structure listed above, or otherwise the data-loading modules would not be able to correctly locate the datasets.
Preprocessing
Several preprocessing steps are required before running training or evaluation scripts. These preprocessing steps provide metadata for the subsequent training/evaluation scripts.
- Collect variable names from datasets. The script will scan through all functions in the dataset and collect their substitutable variable names (local variables and formal parameters). Results will be stored in
./datasets/variable_names_<dataset>.json
# python collect_variable_names_jsonl.py <dataset>
python collect_variable_names_jsonl.py csn_js
python collect_variable_names_jsonl.py csn_java
python collect_variable_names_jsonl.py github_c_funcs
python collect_variable_names_jsonl.py github_java_funcs
python collect_variable_names_jsonl.py mbcpp
python collect_variable_names_jsonl.py mbjp
python collect_variable_names_jsonl.py mbjsp
- Collect all feasible transforms. The script will enumerate all feasible transformation combinations for each function in the dataset. Results will be stored in
./datasets/feasible_transforms_<dataset>.jsonandtarnsforms_per_file_<dataset>.json. Note that this process could take a while, especially for CSN-Java.
# python collect_feasible_transforms_jsonl.py <dataset>
python collect_feasible_transforms_jsonl.py csn_js
python collect_feasible_transforms_jsonl.py csn_java
python collect_feasible_transforms_jsonl.py github_c_funcs
python collect_feasible_transforms_jsonl.py github_java_funcs
python collect_feasible_transforms_jsonl.py mbcpp
python collect_feasible_transforms_jsonl.py mbjp
python collect_feasible_transforms_jsonl.py mbjsp
Running the Experiments
- Training
- Evaluations
- MutableAST benchmark
Training
train_main.py is responsible for all training tasks. Refer to the parse_args() function in train_main.py for more details on the arguments.
Here are some examples.
# training a 4-bit GRU model on CSN-Java
python train_main.py \
--lang=java \
--dataset=csn_java \
--dataset_dir=./datasets/csn_java \
--n_bits=4 \
--epochs=25 \
--log_prefix=4bit_gru_srcmarker \
--batch_size 64 \
--model_arch=gru \
--shared_encoder \
--varmask_prob 0.5 \
--seed 42
# training a 4-bit Transformer model on CSN-JavaScript
python train_main.py \
--lang=javascript \
--dataset=csn_js \
--dataset_dir=./datasets/csn_js \
--n_bits=4 \
--epochs=25 \
--log_prefix=4bit_transformer_srcmarker \
--batch_size 64 \
--model_arch=transformer \
--shared_encoder \
--varmask_prob 0.5 \
--seed 42
Alternatively, you can also use the script_train.sh to conveniently start training. However, you might have to manually modify some of the arguments in it.
# by default, it trains a 4-bit GRU model on a designated dataset,
# you have to manually change some of the variables in the script to run different tasks
# such as model architectures and checkpoint names
# source script_train.sh <dataset>
source script_train.sh csn_java
Checkpoints will be saved in ./ckpts.
Evaluation
Main Evaluation Script
eval_main.py is reponsible for most of the evaluation tasks.
``
