Analysis
The current repository contains all the scripts needed to reproduce the results published in the paper: "Obfuscation Revealed: Electromagnetic obfuscated malware classification".
Install / Use
/learn @ahma-hub/AnalysisREADME
Analysis Tools
The current repository contains all the scripts needed to reproduce the results published in the paper: "Obfuscation Revealed: Electromagnetic obfuscated malware classification".
Overview
.
├── README.md
├── requirements.txt
│── run_dl_on_selected_bandwidth.sh #> script to run the DL for all scenarii on
| # the (full) testing dataset (available on zenodo)
| # using pre-computed models
|── run_dl_on_reduced_dataset.sh #> script to run the training on
| # on a reduced dataset (350 per samples per
| # executable, available on zenodo)
│── run_ml_on_reduced_dataset.sh #> script to run the end-to-end analysis on
| # on a reduced dataset (350 per samples per
| # executable, available on zenodo)
│── run_ml_on_selected_bandwidth.sh #> script to run the ML classification for all
| # for all scenarii on the testing pre-computed
| # dataset (available on zenodo)
│── update_lists.sh #> script to update the location of the traces
│ # in the lists
│
├── ml_analysis
│ │── evaluate.py #> code for the LDA + {NB, SVM} on the
| | # reduced dataset (raw_data_reduced_dataset)
| |── NB.py #> Naïve Bayensian with known model
| | # (traces_selected_bandwidth)
| |── SVM.py #> Support vector machine with known model
| | # (traces_selected_bandwidth)
│ │── log-evaluation_reduced_dataset.txt #> output log file for the ML evaluation
| | # on the reduce datasete
│ │── log-evaluation_selected_bandwidth.txt #> output log file for the ML evaluation
| # using the precomputed models
│
│
│
├── dl_analysis
│ │── evaluate.py #> code to predict MLP and CNN using pretrained models
│ │── training.py #> code to train MLP and CNN and store models
| | # according to best validation accuracy
│ │── evaluation_log_DL.txt #> output log file with stored accuracies on the testing dataset
| |── training_log_reduced_dataset_mlp.txt #> output log file with stored validation accuracies
| | # on the reduced dataset for the mlp neural network over all scenarios and bandwidths
| |── training_log_reduced_dataset_cnn.txt #> output log file with stored validation accuracies
| | # on the reduced dataset for the cnn neural network over all scenarios and bandwidths
|
│
│
├── list_selected_bandwidth #> list of the files used for training,
│ │ # validating and testing (all in one file)
│ │ # for each sceanario (but only the testing
| | # data are available). Lists associated to
| | # the selected bandwidth dataset
│ │── files_lists_tagmap=executable_classification.npy
│ │── files_lists_tagmap=novelty_classification.npy
│ │── files_lists_tagmap=packer_identification.npy
│ │── files_lists_tagmap=virtualization_identification.npy
│ │── files_lists_tagmap=family_classification.npy
│ │── files_lists_tagmap=obfuscation_classification.npy
│ │── files_lists_tagmap=type_classification.npy
│
│
├── list_reduced_dataset #> list of the files used for training,
│ │ # validating and testing (all in one file)
│ │ # for each sceanario. Lists associated to
| | # the reduced dataset
│ │── files_lists_tagmap=executable_classification.npy
│ │── files_lists_tagmap=novelty_classification.npy
│ │── files_lists_tagmap=packer_identification.npy
│ │── files_lists_tagmap=virtualization_identification.npy
│ │── files_lists_tagmap=family_classification.npy
│ │── files_lists_tagmap=obfuscation_classification.npy
│ │── files_lists_tagmap=type_classification.npy
│
├── pre-processings #> codes use to preprocess the raw traces to be
│ # able to run the evaluations
│── list_manipulation.py #> split traces in {learning, testing, validating}
│ # sets
│── accumulator.py #> compute the sum and the square of the sum (to
│ # be able to recompute quickly the NICVS)
│── nicv.py #> to compute the NICVs
│── corr.py #> to compute Pearson coeff (alternative to the
| # NICV)
│── displayer.py #> use to display NICVs, correlations, traces...
│── signal_processing.py #> some signal processings (stft, ...)
|── bandwidth_extractor.py #> extract bandwidth, based on NICVs results
| # and creat new dataset
│── tagmaps #> all tagmaps use for to labelize the data
│ # (use to creat the lists)
│── executable_classification.csv
│── family_classification.csv
│── novelties_classification.csv
│── obfuscation_classification.csv
│── packer_identification.csv
│── type_classification.csv
│── virtualization_identification.csv
Getting Started
python 3.6
To be able to run the analysis you need to install python 3.6 and the required packages:
pip install -r requirements.txt
Data
The testing dataset (spectrograms) used in the paper can be dowload on the following website:
https://zenodo.org/record/5414107
File lists
In order to update the location of the data, you previously dowloaded, inside
the lists please run the script update_lists.sh:
./update_lists [directory where the lists are stored] [directory where the downloaded spectograms are stored]
This must be applyed to directoies list_selected_bandwidth and list_reduced_dataset
respectively associated to the datasets: traces_selected_bandwidth.zip and raw_data_reduced_dataset.zip
Machine Learning (ML)
To run the computation of all the machine learning experiments, you can use
the scripts run_ml_on_reduced_dataset.sh and run_ml_on_extracted_bandwidth.sh:
./run_ml_on_extracted_bandwidth.sh [directory where the lists are stored] [directory where the models are stored] [directory where the accumulated data is stored (precomputed in pretrained_models/ACC) ]
The results are stored in the file ml_analysis/log-evaluation_selected_bandwidth.txt.
./run_ml_on_reduced_dataset.sh
The results are stored in the file ml_analysis/log-evaluation_reduced_dataset.txt.
The directory ml_analysis contains the code needed for the classification by Machine Learning (ML).
evaluate.py
usage: evaluate.py [-h]
[--lists PATH_LISTS]
[--mean_size MEAN_SIZES]
[--log-file LOG_FILE]
[--acc PATH_ACC]
[--nb_of_bandwidth NB_OF_BANDWIDTH]
[--time_limit TIME_LIMIT]
[--metric METRIC]
optional arguments:
-h, --help show this help message and exit
--lists PATH_LISTS Absolute path to a file containing the lists
--mean_size MEAN_SIZES Size of each means
--log-file LOG_FILE Absolute path to the file to save results
--acc PATH_ACC Absolute path of the accumulators directory
--nb_of_bandwidth NB_OF_BANDWIDTH number of bandwidth to extract
--time_limit TIME_LIMIT percentage of time to concerve (from the begining)
--metric METRIC Metric to use for select bandwidth: {nicv, corr}_{mean, max}
NB.py
usage: NB.py [-h]
[--lists PATH_LISTS]
[--model_lda MODEL_LDA]
[--model_nb MODEL_NB]
[--mean_size MEAN_SIZES]
[--log-file LOG_FILE]
[--time_limit TIME_LIMIT]
[--acc PATH_ACC]
optional arguments:
-h, --help show this help message and exit
--lists PATH_LISTS Absolute path to a file containing the lists
--model_lda MODEL_LDA Absolute path to the file where the LDA model has been previously saved
--model_nb MODEL_NB Absolute path to the file where the NB model has been previously saved
--mean_size MEAN_SIZES Size of each means
--log-file LOG_FILE Absolute path to the file to save results
--time_limit TIME_LIMIT percentage of time to concerve (from the begining)
--acc PATH_ACC Absolute path of the accumulators directory
read_logs.py
usage: read_logs.py [-h]
[--path PATH]
[--plot PATH_TO_PLOT]
optional arguments:
-h, --help show this help message and exit
--path PATH Absolute path to the log file
--plot PATH_TO_PLOT Absolute path to save the plot
SVM.py
usage: SVM.py [-h]
[--lists PATH_LISTS]
[--model_lda MODEL_LDA]
[--model_svm MODEL_SVM]
[--mean_size MEAN_SIZES]
[--log-file LOG_FILE]
[--time_limit TIME_LIMIT]
[--acc PATH_ACC]
optional arguments:
-h, --help show this help message and exit
--lists PATH_LISTS Abso
