ProvCTDG

Implementation of the paper: "Anomaly Detection in Continuous-Time Temporal Provenance Graphs". The code follows the CTDG framework https://github.com/gravins/CTDG-learning-framework.

Generate Convert Improve

Install / Use

/learn @JakubReha/ProvCTDG

About this skill

Quality Score

0/100

README

Anomaly Detection in Continuous-Time Temporal Provenance Graphs

Official code repository for the paper "Anomaly Detection in Continuous-Time Temporal Provenance Graphs", which was accepted to Temporal Graph Learning Workshop @ NeurIPS 2023.

Please consider citing us

@inproceedings{
reha2023anomaly,
title={Anomaly Detection in Continuous-Time Temporal Provenance Graphs},
author={Jakub Reha and Giulio Lovisotto and Michele Russo and Alessio Gravina and Claas Grohnfeldt},
booktitle={Temporal Graph Learning Workshop @ NeurIPS 2023},
year={2023},
url={https://openreview.net/forum?id=88tGIxxhsf}
}

Requirements

Note: we assume Miniconda/Anaconda and Docker are installed. The proper Python version is installed during the first step of the following procedure.

Install the required packages and create the environment with create_env script, the env is called ctdg_pyg in it:

./create_env.sh

or create the environment from the yml file

conda env create -f conda_export_env.yml
conda activate ctdg_pyg

other dependencies: dask, gqalchemy, docker, natsorted, tqdm, wandb

We provide pre-processed labeled data with feature extracted here and the extracted groundtruth. If you use this data, you can skip steps 1. to 6. and start directly at step 7..

Download data (Optional)
Data preprocessing (Optional)
Visualize data (Optional)
Data labelling (Optional)
Feature extraction (Optional)
Aggregate ground truth (Optional)
Visualize the ground truth (Optional)
Train and Test (model selection)
Anomaly detection

1. Download data

DARPA TC ENGAGEMENT 3

google drive

Option 1

Download gdrive and setup the service account
Run

while true; do gdrive download <folder_id> --recursive --skip --service-account <credentials_path>.json && break; done

Option 2

can only download 50 files at a time, Google may further limit the amount of downloads

gdown https://drive.google.com/drive/folders/<folder_id> --folder --remaining-ok --no-check-certificate

2. Data preprocessing

See darpa_preprocessing

3. Visualize data

Install Memgraph python client.

pip install gqlalchemy

If you encounter problems during installation you might need to do:

# problem related to  OpenSSL
export C_INCLUDE_PATH=<environment_path>/include/

or reinstall cmake link and then try again.

Then you can start one Memgraph container per dataset: (our code was tested with the version memgraph/memgraph-platform:2.6.5-memgraph2.5.2-lab2.4.0-mage1.6)

docker run -itd --name trace -p 7688:7687 -p 7445:7444 -p 3001:3000 -v <data_path>:/data memgraph/memgraph-platform:2.6.5-memgraph2.5.2-lab2.4.0-mage1.6

docker run -itd --name theia -p 7689:7687 -p 7446:7444 -p 3002:3000 -v <data_path>:/data memgraph/memgraph-platform:2.6.5-memgraph2.5.2-lab2.4.0-mage1.6

Assumed file structure:

<data_path>
│
└─── theia
│       procfact.txt 
│       socketfact.txt 
│       filefact.txt 
│       edgefact_tmp_x.txt 
│   
└─── trace
        ...

python upload_to_memgraph.py --dataset_local data_path/dataset_name --dataset_memgraph memgraph_data_path/dataset_name --port 7688

When loading edges into the Memgraph database, we split the .csv files containing the edges into smaller chunks. This is done because loading smaller files is much faster than loading big files (this is probably a bug in Memgraph and might be fixed in the future).

Uploading TRACE or THEIA takes ~10 min.

The steps of the script:

split edge files into smaller chunks and save inside the 'split' subdirectory
delete all contents of the Memgraph databse
upload nodes
upload edges
delete nodes with 0-degree
create a snapshot of the database

In your browser go to:

localhost:3001

Example of a Memgraph query, Memgraph uses the Cypher language (same as Neo4j)

MATCH (n)-[r]->(m) RETURN * LIMIT 1000

To change the query timeout, modify the --query-execution-timeout-sec parameter in the memgraph.conf file, put it in the docker container and restart the container:

docker cp memgraph.conf trace:/etc/memgraph/memgraph.conf
docker restart trace

To use our Graph Style copy the content of memgraph_graph_style.json into the Graph Style Editor in your browser.

4. Data labelling

See Memgraph installation and data upload above (3. Visualize data).

Then follow darpa_labelling.

5. Feature extraction

This extracts features for nodes, aggregates edges (see detailed explanation in our paper) and computes new unique hashes for the edges.

python process_data.py --dataset TRACE --save_folder save_folder --ground_truth_folder ground_truth_folder --graph_folder graph_folder

6. Aggregate ground truth

The ground truth contains labels for the raw edges. If you want to visualize the ground truth or compute the statistics, aggregation is necessary:

cd darpa_labelling/tools
python aggregate.py --ground_truth_folder ground_truth_folder --edges_path edges_folder/edges.csv --dataset TRACE

7. Visualize the ground truth

If you did not execute steps 1. to 6., extract the datasets we provide inside the DATA folder (e.g., /somepath/data/darpa_datasets)

If you want to upload the ground truth or prediction of a model use:

cd src
python upload_to_memgraph.py --ground_truth --port 7688 --dataset TRACE --ground_truth_path ../groundtruth/ --local_path_to_memgraph /memgraph/

Note that uploading is quite slow, therefore the code handles uploading only the aggregated ground truth and predictions (step 5. Feature extraction is necessary).

8. Train and Test Link Prediction (model selection)

The datasets in the DATA folder are split (ext_roll column in edges.csv) into train(0), validation(1) and test(2) sets in a way that the test set contains all the malicious attacks. The rest is split temporally in the ratio 0.85/0.15 into train and validation sets.

Before training, set the hyperparameters and other settings (OHD-TGN, DIR-TGN, Hetero-TGN, HGT-TGN) in conf.py.

Hetero-TGN: 'hetero_gnn' = [True], 'hetero_transformer' = [False]
HGT-TGN: 'hetero_gnn' = [True], 'hetero_transformer' = [True]
OHD-TGN: 'one_hot_dir' = [True]
DIR-TGN: 'dir_GNN' = [True]

For other hyperameters, the code will run model selection if multiple hyperparameter values are provided. Each combination of hyperparameter values is called a 'configuration' and has a unique 'conf_id' assigned to itself.

The code can train multiple configurations in parallel using Ray (--parallelism). After the validation loss converges (--patience), inference on the test set is performed and the resulting prediction scores are saved into .csv files in <save_dir>/<model>/ckpt/ for each configuration and random seed(--num_runs). Note that these prediction scores corresponds to the predicted probability of edge existence. Model selection results (containing all information about individual configurations and scores for all metrics on train, validation and test sets) are saved in the <save_dir>/<model>/model_selection_results.csv path. When evaluating the test set, the malicious edges are masked out. Model checkpoints are saved in <save_dir>/<model>/ckpt/.

We use Weights&Biases for logging.

Note that if a dataset is run for the first time, cache files are created in <dataset_path>/temporal_processed/ and <dataset_path>/delta_t_stats.pkl/. Therefore, if something within a dataset is changed later on, please remove the cached files, so they can be computed again on the updated dataset.

A. Train CTDHG

python -u main.py --data_name darpa_theia_0to24 --model TGN --version temporal --parallelism 5 --epochs 50 --batch 200 --save_dir experiment_name --data_dir DATA --num_runs 5 --patience 5 --wandb --no_check_link_existence > out_experiment_name 2> err_experiment_name

B. Train Graph Baseline

python -u main.py --data_name darpa_theia_0to24 --model RGCN --version static --parallelism 5 --epochs 50 --batch 200 --save_dir experiment_name --data_dir DATA --num_runs 5 --patience 5 --wandb --no_check_link_existence > out_experiment_name 2> err_experiment_name

C. Train MLP

python -u main.py --data_name darpa_theia_0to24 --model MLP --version temporal --parallelism 5 --epochs 50 --batch 200 --save_dir experiment_name --data_dir DATA --num_runs 5 --patience 5 --wandb --no_check_link_existence > out_experiment_name 2> err_experiment_name

9. Anomaly detection

After you have trained a model you can compute the anomaly scores with the following script:

python -u anomaly_detection.py --prediction_folder prediction_folder --ground_truth_path path_to_ground_truth --save_folder save_folder --model_name TGN --dataset THEIA --conf_id 0 --wandb

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

flutter-tutor

Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d

groundhog

400

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

workshop-rules

Materials used to teach the summer camp <Data Science for Kids>

JakubReha

View profile

View on GitHub

GitHub Stars21

CategoryEducation

Updated2mo ago

Forks5

JakubReha/ProvCTDG

Languages

Python

Security Score

75/100

Audited on Feb 1, 2026

No findings