Analysis Tools

The current repository contains all the scripts needed to reproduce the results published in the paper: "Obfuscation Revealed: Electromagnetic obfuscated malware classification".

Overview

.
├── README.md
├── requirements.txt
│── run_dl_on_selected_bandwidth.sh #> script to run the DL for all scenarii on  
|                                   # the (full) testing dataset (available on zenodo) 
|                                   # using pre-computed models
|── run_dl_on_reduced_dataset.sh    #> script to run the training on 
|                                   # on a reduced dataset (350 per samples per 
|                                   # executable, available on zenodo)
│── run_ml_on_reduced_dataset.sh    #> script to run the end-to-end analysis on 
|                                   # on a reduced dataset (350 per samples per 
|                                   # executable, available on zenodo)
│── run_ml_on_selected_bandwidth.sh #> script to run the ML classification for all
|                                   # for all scenarii on the testing pre-computed 
|                                   # dataset (available on zenodo)
│── update_lists.sh                 #> script to update the location of the traces 
│                                   # in the lists 
│
├── ml_analysis
│   │── evaluate.py                            #> code for the LDA + {NB, SVM} on the 
|   |                                          # reduced dataset (raw_data_reduced_dataset)
|   |── NB.py                                  #> Naïve Bayensian with known model 
|   |                                          # (traces_selected_bandwidth)
|   |── SVM.py                                 #> Support vector machine with known model
|   |                                          # (traces_selected_bandwidth)
│   │── log-evaluation_reduced_dataset.txt     #> output log file for the ML evaluation
|   |                                          # on the reduce datasete
│   │── log-evaluation_selected_bandwidth.txt  #> output log file for the ML evaluation
|                                              # using the precomputed models
│ 
│
│
├── dl_analysis
│   │── evaluate.py            			#> code to predict MLP and CNN using pretrained models
│   │── training.py            			#> code to train MLP and CNN and store models 
|   |                          			# according to best validation accuracy   
│   │── evaluation_log_DL.txt  			#> output log file with stored accuracies on the testing dataset
|   |── training_log_reduced_dataset_mlp.txt    #> output log file with stored validation accuracies 
|   |                          			# on the reduced dataset for the mlp neural network over all scenarios and bandwidths
|   |── training_log_reduced_dataset_cnn.txt    #> output log file with stored validation accuracies 
|   |                          			# on the reduced dataset for the cnn neural network over all scenarios and bandwidths
|
│
│
├── list_selected_bandwidth    #> list of the files used for training, 
│   │                          # validating and testing (all in one file)
│   │                          # for each sceanario (but only the testing 
|   |                          # data are available). Lists associated to
|   |                          # the selected bandwidth dataset
│   │── files_lists_tagmap=executable_classification.npy                              
│   │── files_lists_tagmap=novelty_classification.npy   
│   │── files_lists_tagmap=packer_identification.npy
│   │── files_lists_tagmap=virtualization_identification.npy
│   │── files_lists_tagmap=family_classification.npy 
│   │── files_lists_tagmap=obfuscation_classification.npy
│   │── files_lists_tagmap=type_classification.npy   
│
│
├── list_reduced_dataset    #> list of the files used for training, 
│   │                       # validating and testing (all in one file)
│   │                       # for each sceanario. Lists associated to
|   |                       # the reduced dataset
│   │── files_lists_tagmap=executable_classification.npy                              
│   │── files_lists_tagmap=novelty_classification.npy   
│   │── files_lists_tagmap=packer_identification.npy
│   │── files_lists_tagmap=virtualization_identification.npy
│   │── files_lists_tagmap=family_classification.npy 
│   │── files_lists_tagmap=obfuscation_classification.npy
│   │── files_lists_tagmap=type_classification.npy   
│
├── pre-processings            #> codes use to preprocess the raw traces to be 
    │                          # able to run the evaluations
    │── list_manipulation.py   #> split traces in {learning, testing, validating} 
    │                          # sets 
    │── accumulator.py         #> compute the sum and the square of the sum (to 
    │                          # be able to recompute quickly the NICVS)  
    │── nicv.py                #> to compute the NICVs
    │── corr.py                #> to compute Pearson coeff (alternative to the 
    |                          # NICV)
    │── displayer.py           #> use to display NICVs, correlations, traces...
    │── signal_processing.py   #> some signal processings (stft, ...)
    |── bandwidth_extractor.py #> extract bandwidth, based on NICVs results
    |                          # and creat new dataset
    │── tagmaps                #> all tagmaps use for to labelize the data 
        │                      # (use to creat the lists)
    	│── executable_classification.csv  
    	│── family_classification.csv
    	│── novelties_classification.csv  
    	│── obfuscation_classification.csv  
    	│── packer_identification.csv  
    	│── type_classification.csv  
    	│── virtualization_identification.csv

Getting Started

python 3.6

To be able to run the analysis you need to install python 3.6 and the required packages:

pip install -r requirements.txt

Data

The testing dataset (spectrograms) used in the paper can be dowload on the following website:

https://zenodo.org/record/5414107

File lists

In order to update the location of the data, you previously dowloaded, inside the lists please run the script update_lists.sh:

./update_lists  [directory where the lists are stored] [directory where the downloaded spectograms are stored]

This must be applyed to directoies list_selected_bandwidth and list_reduced_dataset respectively associated to the datasets: traces_selected_bandwidth.zip and raw_data_reduced_dataset.zip

Machine Learning (ML)

To run the computation of all the machine learning experiments, you can use the scripts run_ml_on_reduced_dataset.sh and run_ml_on_extracted_bandwidth.sh:

./run_ml_on_extracted_bandwidth.sh  [directory where the lists are stored] [directory where the models are stored] [directory where the accumulated data is stored (precomputed in pretrained_models/ACC) ]

The results are stored in the file ml_analysis/log-evaluation_selected_bandwidth.txt.

./run_ml_on_reduced_dataset.sh

The results are stored in the file ml_analysis/log-evaluation_reduced_dataset.txt.

The directory ml_analysis contains the code needed for the classification by Machine Learning (ML).

evaluate.py

usage: evaluate.py [-h] 
                   [--lists PATH_LISTS]  
                   [--mean_size MEAN_SIZES]  
                   [--log-file LOG_FILE]  
                   [--acc PATH_ACC]  
                   [--nb_of_bandwidth NB_OF_BANDWIDTH]  
                   [--time_limit TIME_LIMIT]  
                   [--metric METRIC]

optional arguments:
  -h, --help                         show this help message and exit
  --lists PATH_LISTS                 Absolute path to a file containing the lists
  --mean_size MEAN_SIZES             Size of each means
  --log-file LOG_FILE                Absolute path to the file to save results
  --acc PATH_ACC                     Absolute path of the accumulators directory
  --nb_of_bandwidth NB_OF_BANDWIDTH  number of bandwidth to extract
  --time_limit TIME_LIMIT            percentage of time to concerve (from the begining)
  --metric METRIC                    Metric to use for select bandwidth: {nicv, corr}_{mean, max}

NB.py

usage: NB.py [-h] 
             [--lists PATH_LISTS] 
             [--model_lda MODEL_LDA]  
             [--model_nb MODEL_NB]  
             [--mean_size MEAN_SIZES]  
             [--log-file LOG_FILE]  
             [--time_limit TIME_LIMIT]  
             [--acc PATH_ACC]

optional arguments:
  -h, --help               show this help message and exit
  --lists PATH_LISTS       Absolute path to a file containing the lists
  --model_lda MODEL_LDA    Absolute path to the file where the LDA model has been previously saved
  --model_nb MODEL_NB      Absolute path to the file where the NB model has been previously saved
  --mean_size MEAN_SIZES   Size of each means
  --log-file LOG_FILE      Absolute path to the file to save results
  --time_limit TIME_LIMIT  percentage of time to concerve (from the begining)
  --acc PATH_ACC           Absolute path of the accumulators directory

read_logs.py

usage: read_logs.py [-h]
				    [--path PATH]
                    [--plot PATH_TO_PLOT]

optional arguments:
  -h, --help           show this help message and exit
  --path PATH          Absolute path to the log file
  --plot PATH_TO_PLOT  Absolute path to save the plot

SVM.py

usage: SVM.py [-h] 
              [--lists PATH_LISTS] 
              [--model_lda MODEL_LDA] 
              [--model_svm MODEL_SVM] 
              [--mean_size MEAN_SIZES] 
              [--log-file LOG_FILE] 
              [--time_limit TIME_LIMIT] 
              [--acc PATH_ACC]

optional arguments:
  -h, --help               show this help message and exit
  --lists PATH_LISTS       Abso

Analysis

Install / Use

README