On Training Robust PDF Malware Classifiers

Code for our paper On Training Robust PDF Malware Classifiers (Usenix Security'20) Yizheng Chen, Shiqi Wang, Dongdong She, Suman Jana

Blog Posts

Monotonic Malware Classifiers (5 min read)

Gmail's malicious document classifier can still be trivially evaded (3 min read)

Dataset

Full PDF dataset

Available here at contagio: "16,800 clean and 11,960 malicious files for signature testing and research."

Training and Testing datasets

We split the PDFs into 70% train and 30% test. Then, we used the Hidost feature extractor to extract structural paths features, with the default compact path option. We obtained the following training and testing data.

| | Training PDFs | Testing PDFs | |---|---|---| | Malicious | 6,896 | 3,448 | | Benign | 6,294 | 2,698 |

The hidost structural paths are here.

The extracted training and testing libsvm files are here. The 500 seed malware samples with network activities from EvadeML are in the test set.

500 seed malware hash list.. Put these PDFs under data/500_seed_pdfs/.

Models

The following models are TensorFlow checkpoints, except that two ensemble models need additional wrappers.

| Checkpoint | Model | |---|---| | baseline_checkpoint | Baseline | | baseline_adv_delete_one | Adv Retrain A | | baseline_adv_insert_one | Adv Retrain B | | baseline_adv_delete_two | Adv Retrain C | | baseline_adv_insert_rootallbutone | Adv Retrain D | | baseline_adv_combine_two | Adv Retrain A+B | | adv_del_twocls | Ensemble A+B Base Learner | | adv_keep_twocls | Ensemble D Base Learner | | robust_delete_one | Robust A | | robust_insert_one | Robust B | | robust_delete_two | Robust C | | robust_insert_allbutone | Robust D | | robust_monotonic | Robust E | | robust_combine_two_v2_e18 | Robust A+B | | robust_combine_three_e17 | Robust A+B+E |

The following are XGBoost tree ensemble models.

| Binary | Model | |---|---| | model_10learner_test.bin | Monotonic Classifier, 10 learners | | model_100learner.bin | Monotonic Classifier, 100 learners | | model_1000learner.bin | Monotonic Classifier, 1000 learners | | model_2000learner.bin | Monotonic Classifier, 2000 learners |

Training Code

To train and evaluate the VRAs of baseline model, adv retrain models, ensemble models, XGBoost monotonic models, and robust models, see README under train/.

Attacks in the Paper

See README under attack/.

After running our adaptive attack based on EvadeML against the Robust A+B+E model for three weeks, we were not able to fully evade the model to generate functional evasive PDF malware variants. As shown in the Figure above, the estimated robust accuracy against adaptive attacks can be reduced to 0% for Monotonic 100 and Robust A+B models, but not Robust A+B+E model. We hope researchers can design stronger attacks to evade our Robust A+B+E model.

Using the EvadeML framework, our adaptive strategies against this model are:

Move Exploit Attack. The monotonic property (Property E) forces the attacker to delete objects from the malware, but deletion could remove the exploit. Therefore, we implement a new mutation to move the exploit around to different trigger points in the PDF.
Scatter Attack. To evade Robust A+B and Robust A+B+E, we insert and delete more objects under different subtrees. We keep track of past insertion and deletion operations separately, and prioritize new insertion and deletion operations to target a different subtree.

MalGAN Attack Evaluation

Please check out this MalGAN attack evaluation against our robust models by Zeyi.

Pdfclassifier

Install / Use

README