FunctionSimSearch

FunctionSimSearch - Example C++ code to demonstrate how to do SimHash-based similarity search over CFGs extracted from disassemblies.

Getting started for the lazy (using Docker)

Make sure you have Docker installed. Then do:

docker build -t functionsimsearch .

After the container is built, you can run all relevant commands by doing

sudo docker run -it --rm -v $(pwd):/pwd functionsimsearch COMMAND ARGUMENTS_TO_COMMAND
sudo docker run -it --rm -v $(pwd):/pwd functionsimsearch disassemble --format=ELF --input=/bin/tar

The last command should dump the disassembly of ./elf_file to stdout.

Getting Started for the less lazy (build from source)

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

This code has a few external dependencies. The dependencies are:

CMake, for building
DynInst 9.3, built and the relevant shared libraries installed
(In order to build DynInst, you may need to build libdwarf from scratch with --enable-shared)
PE-parse, a C++ PE parser: https://github.com/trailofbits/pe-parse.git
PicoSHA2, a C++ SHA256 hash library: https://github.com/okdshin/PicoSHA2.git
SPII, a C++ library for automatic differentiation & optimization: https://github.com/PetterS/spii.git
JSON.hpp, a C++ library for using JSON: https://github.com/nlohmann/json.git
GoogleTest, a C++ unit testing library
GFlags, a C++ library for handling command line parameters

Installing

You should be able to build on a Debian stretch machine by running the following bash script in the directory where you checked out everything:

./build_dependencies.sh

The script does the following:

#!/bin/bash
source_dir=$(pwd)

# Install gtest and gflags. It's a bit fidgety, but works:
sudo apt-get install libgtest-dev libgflags-dev libz-dev libelf-dev cmake g++
sudo apt-get install libboost-system-dev libboost-date-time-dev libboost-thread-dev
cd /usr/src/gtest
sudo cmake CMakeLists
sudo make
sudo cp *.a /usr/lib

# Now get and the other dependencies
cd $source_dir
mkdir third_party
cd third_party

# Download Dyninst.
wget https://github.com/dyninst/dyninst/archive/v9.3.2.tar.gz
tar xvf ./v9.3.2.tar.gz
# Download PicoSHA, pe-parse, SPII and the C++ JSON library.
git clone https://github.com/okdshin/PicoSHA2.git
git clone https://github.com/trailofbits/pe-parse.git
git clone https://github.com/PetterS/spii.git
mkdir json
mkdir json/src
cd json/src
wget https://github.com/nlohmann/json/releases/download/v3.1.2/json.hpp 
cd ../..

# Build PE-Parse.
cd pe-parse
cmake ./CMakeLists.txt
make -j8
cd ..

# Build SPII.
cd spii
cmake ./CMakeLists.txt
make -j8
sudo make install
cd ..

# Build Dyninst
cd dyninst-9.3.2
cmake ./CMakeLists.txt
make -j8
sudo make install
sudo ldconfig
cd ..

# Finally build functionsimsearch tools
cd ..
make -j8

This should build the relevant executables to try. On Debian stretch and later, you may have to add '-fPIC' into the pe-parse CMakeLists.txt to make sure your generated library supports being relocated.

Running the tests

You can run the tests by doing:

cd tests
./runtests
./slowtests

Note that the tests use relative directories, assuming that you actually changed your directory, so running

tests/runtests

will not work.

Also be aware that a fair number of the tests are pretty expensive to run, and expect the full testsuite to eat all your CPU for a few minutes; the suite of slow tests may keep your computer busy for hours.

Running the tools

At the moment, the following executables will be built (in alphabetical order):

addfunctionstoindex

./addfunctionstoindex -format=ELF -input=/bin/tar -index=./function_search.index -minimum_function_size=5 -weights=./weights.txt
./addfunctionstoindex -format=PE -input=~/sources/mpengine/engine/mpengine.dll -index=./function_search.index -minimum_function_size=5

Disassemble the specified input file, find functions with more than 5 basic blocks, calculate the SimHash for each such function and add it to the search index file.

addsinglefunctiontoindex

./addsinglefunctiontoindex -format=ELF -input=/bin/tar -index=./function_search.index -function_address=0x40deadb -weights=./weights.txt
./addsinglefunctionstoindex -format=PE -input=~/sources/mpengine/engine/mpengine.dll -index=./function_search.index -function_address=0x40deadb

Disassemble the specified input file, then find the function at the specified address and at it to the search index. Incurs the full cost of disassembling the entire executable, so use with care.

createfunctionindex

./createfunctionindex -index=./function_search.index

Creates a file to use for the function similarity search index. Most likely the first command you want to run.

disassemble

./disassemble -format=ELF -input=/bin/tar
./disassemble -format=PE -input=~/sources/mpengine/engine/mpengine.dll

Disassemble the specified file and dump the disassembly to stdout. The input file can either be a 32-bit/64-bit ELF, or a 32-bit PE file. Adding support for 64-bit PE is easy and will be done soon.

dotgraphs

./dotgraphs -format=ELF -input=/bin/tar -output=/tmp/graphs
./dotgraphs -format=PE -input=~/sources/mpengine/engine/mpengine.dll -output=/tmp/graphs

Disassemble the specified file and write the CFGs as dot files to the specified directory.

dumpfunctionindex

./dumpfunctionindex -index=./function_search.index

Dumps the content of the search index to text. The content consists of 5 text colums:

| HashID | SimHash First Part | SimHash Second Part | Executable ID | Address | | ------ | ------------------ | ------------------- | ------------- | ------- | | ... | ... | ... | ... | ... |

dumpfunctionindexinfo

./dumpfunctionindexinfo -index=./function_search.index

Prints information about the index file - how much space is used, how much space is left, how many functions are indexed etc.

Example output:

[!] FileSize: 537919488 bytes, FreeSpace: 36678432 bytes
[!] Indexed 270065 functions, total index has 7561820 elements

dumpsinglefunctionfeatures

./dumpsinglefunctionfeatures -format=ELF -input=/bin/tar -function_address=0x43AB900

Disassembles the input file, finds the relevant function, and dumps the 64-bit IDs of the features that will be used for the SimHash calculation to stdout. You will probably not need this command unless you experiment with the machine learning features in the codebase.

evalsimhashweights

./evalsimhashweights -data=datadirectory -weights=./weights.txt

Evaluates the weight file specified on labeled data in /datadirectory. Refer to the tutorial about weight learning for details.

functionfingerprints

./functionfingerprints -format=ELF -input=/bin/tar -minimum_function_size=5 -verbose=true
./functionfingerprints -format=PE -input=~/sources/mpengine/engine/mpengine.dll -minimum_function_size=5 -verbose=false

Disassembles the target file and all functions therein. If the last argument (verbose) is set to "false", this tool will simply dump the SimHash hashes of the functions in the specified executable to stdout, in the format:

FileID:Address SimHashA SimHashB

If verbose is set to "true", the tool will dump the feature IDs of the features that enter the SimHash calculation, so the output will look like:

FileID:Address Feature1 Feature2 ... FeatureN
FileID:Address FeatureM ... FeatureK

The features themselves are 128-bit hashes. The output of the tool in verbose mode is used to create training data for the machine learning components.

graphhashes

./graphhashes -format=ELF -input=/bin/tar
./graphhashes -format=PE -input=~/sources/mpengine/engine/mpengine.dll

Disassemble the specified file and write a hash of the cfg structure of each disassembled function to stdout. These hashes encode only the graph structure and completely ignore any mnemonics; as such they are not very useful on small graphs.

growfunctionindex

./growfunctionindex -index=./function_search.index -size_to_grow=512

Expand the search index file by 512 megabytes. Index files unfortunately cannot be dynamically resized, so when one nears being full, it is a good idea to grow it.

matchfunctionsfromindex

./matchfunctionsfromindex -format=ELF -input=/bin/tar -index=./function_search.index -minimum_function_size=5 -max_matches=10 -minimum_percentage=0.90 -weights=./weights_to_use.txt
./matchfunctionsfromindex -format=PE -input=~/sources/mpengine/engine/mpengine.dll -index=./function_search.index -minimum_function_size=5 -max_matches=10 -minimum_percentage=.90

Disassemble the specified input file, and for each function with more than 5 basic blocks, retrieve the top-10 most similar functions from the search index. Each match must be at least 90% similar.

trainsimhashweights

./trainsimhashweights -data=/tmp/datadir -train_steps=500 -weights=./trained_weights.txt

A command line tool to infer feature weights from examples. Uses the data in the specified data directory, trains for 500 iterations (using LBFGS), and then writes the resulting weights to the specified file.

End-to-end tutorial: How to build an index of vulnerable functions to scan for

Let's assume that weights have been trained already, and placed in a file called "trained_weights_500.txt".

# Create a new index file.
bin/createfunctionindex --index="./trained.index"

# Grow the index to be 2 gigs in size.
bin/growfunctionindex --index="./trained.index" --size_to_grow=2048

# Add DLLs with interesting functions to the search index.
for dll in $(find -iname \*.dll); do \
  bin/addfunctionstoindex -format=PE -index=,/trained.index -weights=./trained_weights_500.txt --input $(pwd)/$dll; done

#

Functionsimsearch

Install / Use

README