MGTAB

MGTAB: A Multi-Relational Graph-Based Twitter Account Detection Benchmark

Introduction

MGTAB is the first standardized graph-based benchmark for stance and bot detection. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. For more details, please refer to the MGTAB paper.

Distribution of labels in annotations.

<table> <thead> <tr> <td colspan=3 align="center">Stance</td> <td colspan=3 align="center">Bot</td> </tr> </thead> <tbody> <tr> <td colspan=1 align="center">Lable</td> <td colspan=1 align="center">Count</td> <td colspan=1 align="center">Percentage</td> <td colspan=1 align="center">Lable</td> <td colspan=1 align="center">Count</td> <td colspan=1 align="center">Percentage</td> </tr> <tr> <td colspan=1 align="center">neutral</td> <td colspan=1 align="center">3,776</td> <td colspan=1 align="center">37.02</td> <td colspan=1 align="center">human</td> <td colspan=1 align="center">7,451</td> <td colspan=1 align="center">73.06</td> </tr> <tr> <td colspan=1 align="center">against</td> <td colspan=1 align="center">3,637</td> <td colspan=1 align="center">35.66</td> <td colspan=1 align="center">bot</td> <td colspan=1 align="center">2,748</td> <td colspan=1 align="center">26.94</td> </tr> <tr> <td colspan=1 align="center">support</td> <td colspan=1 align="center">2,786</td> <td colspan=1 align="center">27.32</td> <td colspan=3 align="center"> </td> </tr> </tbody> </table> MGTAB contains 10,199 expert-annotated users, and 400,000 additional unlabelled users in MGTAB-large compared to MGTAB.

Multiple relations in the MGTAB.

Our proposed dataset has seven types of user relationships.

<table> <thead> <tr> <td colspan=8 align="center">MGTAB</td> </tr> </thead> <tbody> <tr> <td colspan=1 align="center">Edge type</td> <td colspan=1 align="center">followers</td> <td colspan=1 align="center">friends</td> <td colspan=1 align="center">mention</td> <td colspan=1 align="center">reply</td> <td colspan=1 align="center">quoted</td> <td colspan=1 align="center">URL</td> <td colspan=1 align="center">hashtag</td> </tr> <tr> <td colspan=1 align="center">Numbers</td> <td colspan=1 align="center">308,120</td> <td colspan=1 align="center">412,575</td> <td colspan=1 align="center">114,516</td> <td colspan=1 align="center">223,466</td> <td colspan=1 align="center">77,631</td> <td colspan=1 align="center">263,800</td> <td colspan=1 align="center">300,000</td> </tr> <thead> <tr> <td colspan=8 align="center">MGTAB-large</td> </tr> </thead> <tr> <td colspan=1 align="center">Edge type</td> <td colspan=1 align="center">followers</td> <td colspan=1 align="center">friends</td> <td colspan=1 align="center">mention</td> <td colspan=1 align="center">reply</td> <td colspan=1 align="center">quoted</td> <td colspan=1 align="center">URL</td> <td colspan=1 align="center">hashtag</td> </tr> <tr> <td colspan=1 align="center">Numbers</td> <td colspan=1 align="center">31,990,488</td> <td colspan=1 align="center">49,668,723</td> <td colspan=1 align="center">7,135,192</td> <td colspan=1 align="center">1,018,834</td> <td colspan=1 align="center">182,296</td> <td colspan=1 align="center">51,281</td> <td colspan=1 align="center">7,950,896</td> </tr> </tbody> </table>

Enviromment

python 3.7
scikit-learn 1.0.2
torch 1.8.1+cu111
torch_cluster-1.5.9
torch_scatter-2.0.6
torch_sparse-0.6.9
torch_spline_conv-1.2.1
torch-geometric 2.0.4
pytorch-lightning 1.5.0

Train Model

To start training process:

Train GNN models

python MGTAB-GNN.py  --task stance --model GCN --relation_select 0 1 --random_seed 0 1 2 3 4
python MGTAB-GNN.py  --task bot --model RGCN --relation_select 0 1 --random_seed 0 1 2 3 4

Train Machine Learning models

python MGTAB-ML.py  --task stance --models_list 1 2 3  --random_seed 0 1 2 3 4
python MGTAB-ML.py  --task bot --models_list 4 5 6 7  --random_seed 0 1 2 3 4

Train GNN models parallel using multi-gpu

python GNN_sample_large.py  --task bot --relation_select 0 1 2 3 4 4 6 --model RGT --GPU_num 4
python GNN_sample_large.py  --task bot --relation_select 0 1 2 3 4 --model SHGN --GPU_num 4
python GNN_sample_large.py  --task stance --relation_select 0 1 --model GCN --GPU_num 4
python GNN_sample_large.py  --task stance --relation_select 0 --model GAT --GPU_num 4

Baseline performance

Stance detection performance on MGTAB

| methods | type | accuracy | precision | recall | f1-score | | ------------------- | ---- | --------------------- | -------------------- | -------------------- | -------------------- | | AdaBoost | F | 74.59 ${1.41}$ | 74.60 ${1.35}$ | 74.02 ${1.61}$ | 73.88 ${1.47}$ | | Random Forest | F | 79.62 ${0.68}$ | 80.04 ${0.43}$ | 78.83 ${0.98}$ | 79.04 ${0.82}$ | | Decision Tree | F | 66.92 ${0.93}$ | 66.34 ${1.02}$ | 66.23 ${1.06}$ | 66.03 ${0.84}$ | | SVM | F | 81.23 ${0.66}$ | 81.40 ${0.71}$ | 80.86 ${1.09}$ | 80.71 ${0.78}$ | | KNN | F | 76.25 ${1.32}$ | 75.54 ${1.41}$ | 75.70 ${1.37}$ | 75.48 ${1.37}$ | | Logistic Regression | F | 79.51 ${1.00}$ | 79.33 ${0.98}$ | 78.83 ${1.17}$ | 78.98 ${1.11}$ | | GCN | G | 81.35 ${0.58}$ | 81.08 ${0.30}$ | 80.19 ${0.56}$ | 80.08 ${0.56}$ | | GrapgSAGE | G | 83.33 ${1.22}$ | 82.52 ${1.63}$ | 83.45 ${0.63}$ | 82.72 ${1.34}$ | | GAT | G | 82.19 ${1.23}$ | 81.72 ${1.19}$ | 81.68 ${1.16}$ | 81.04 ${1.24}$ | | HGT | G | 83.29 ${0.44}$ | 81.63 ${0.58}$ | 81.51 ${0.76}$ | 81.82 ${0.34}$ | | S-HGN | G | 85.32 ${0.53}$ | 83.93 ${0.67}$ | 83.65 ${0.65}$ | 84.42 ${0.43}$ | | BotRGCN | G | 84.71 ${1.43}$ | 83.43 ${1.23}$ | 84.08 ${0.94}$ | 84.30 ${1.44}$ | | RGT | G | 87.78 ${0.43}$ | 85.22 ${0.89}$ | 84.40 ${0.74}$ | 86.86 ${0.43}$ |

Bot detection performance on MGTAB

| methods | type | accuracy | precision | recall | f1-score | | ------------------- | ---- | -------------------- | -------------------- | -------------------- | -------------------- | | AdaBoost | F | 90.12 ${0.92}$ | 88.51 ${1.33}$ | 89.10 ${0.92}$ | 87.71 ${1.10}$ | | Random Forest | F | 89.52 ${0.44}$ | 88.92 ${0.49}$ | 86.72 ${1.15}$ | 86.83 ${0.53}$ | | Decision Tree | F | 87.13 ${0.51}$ | 83.81 ${0.72}$ | 83.39 ${1.06}$ | 83.70 ${0.74}$ | | SVM | F | 88.68 ${1.40}$ | 85.73 ${1.84}$ | 85.73 ${1.84}$ | 85.31 ${1.73}$ | | KNN | F | 85.78 ${0.84}$ | 82.28 ${1.22}$ | 80.49 ${0.64}$ | 81.28 ${0.66}$ | | Logistic Regression | F | 88.49 ${1.31}$ | 85.69 ${1.69}$ | 84.41 ${1.96}$ | 84.97 ${1.67}$ | | GCN | G | 85.81 ${1.32}$ | 77.40 ${2.12}$ | 84.37 ${1.73}$ | 78.33 ${1.67}$ | | GrapgSAGE | G | 88.71 ${1.24}$ | 85.33 ${1.83}$ | 86.15 ${2.55}$ | 85.44 ${1.08}$ | | GAT | G | 86.96 ${1.28}$ | 79.71 ${2.96}$ | 84.88 ${1.52}$ | 82.33 ${2.12}$ | | HGT | G | 90.28 ${0.29}$ | 85.35 ${0.33}$ | 85.97 ${0.41}$ | 87.52 ${0.37}$ | | S-HGN | G | 91.42 ${0.43}$ | 87.40 ${0.67}$ | 86.73 ${0.64}$ | 88.72 ${0.58}$ | | BotRGCN | G | 89.60 ${0.82}$ | 85.21 ${1.81}$ | 87.07 ${1.38}$ | 87.16 ${0.74}$ | | RGT | G | 92.12 ${0.37}$ | 88.08 ${0.43}$ | 86.64 ${0.25}$ | 90.41 ${0.47}$ |

Licensing

The MGTAB dataset uses the CC BY-NC-ND 4.0 license. Implemented code in the MGTAB evaluation framework uses the MIT license.

Datasets download

For SemEval-2016 T6, visit the SemEval2016 repository. For SemEval-2019 T7, visit the SemEval2019 github repository. For TwiBot-20, visit the TwiBot-20 github repository. For TwiBot-22, visit the TwiBot-22 github repository. For other bot detection datasets, please visit the Bot Repository.

MGTAB is available at Google Drive. MGTAB-large (contains 400,000 unlabeled us

MGTAB

Install / Use

README

MGTAB

Introduction

Distribution of labels in annotations.

Multiple relations in the MGTAB.

Enviromment

Train Model

Baseline performance

Stance detection performance on MGTAB

Bot detection performance on MGTAB

Licensing

Datasets download