Functionsimsearch
Some C++ example code to demonstrate how to perform code similarity searches using SimHashing.
Install / Use
/learn @googleprojectzero/FunctionsimsearchREADME
FunctionSimSearch
FunctionSimSearch - Example C++ code to demonstrate how to do SimHash-based similarity search over CFGs extracted from disassemblies.
Getting started for the lazy (using Docker)
Make sure you have Docker installed. Then do:
docker build -t functionsimsearch .
After the container is built, you can run all relevant commands by doing
sudo docker run -it --rm -v $(pwd):/pwd functionsimsearch COMMAND ARGUMENTS_TO_COMMAND
sudo docker run -it --rm -v $(pwd):/pwd functionsimsearch disassemble --format=ELF --input=/bin/tar
The last command should dump the disassembly of ./elf_file to stdout.
Getting Started for the less lazy (build from source)
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Prerequisites
This code has a few external dependencies. The dependencies are:
- CMake, for building
- DynInst 9.3, built and the relevant shared libraries installed
- (In order to build DynInst, you may need to build libdwarf from scratch with --enable-shared)
- PE-parse, a C++ PE parser: https://github.com/trailofbits/pe-parse.git
- PicoSHA2, a C++ SHA256 hash library: https://github.com/okdshin/PicoSHA2.git
- SPII, a C++ library for automatic differentiation & optimization: https://github.com/PetterS/spii.git
- JSON.hpp, a C++ library for using JSON: https://github.com/nlohmann/json.git
- GoogleTest, a C++ unit testing library
- GFlags, a C++ library for handling command line parameters
Installing
You should be able to build on a Debian stretch machine by running the following bash script in the directory where you checked out everything:
./build_dependencies.sh
The script does the following:
#!/bin/bash
source_dir=$(pwd)
# Install gtest and gflags. It's a bit fidgety, but works:
sudo apt-get install libgtest-dev libgflags-dev libz-dev libelf-dev cmake g++
sudo apt-get install libboost-system-dev libboost-date-time-dev libboost-thread-dev
cd /usr/src/gtest
sudo cmake CMakeLists
sudo make
sudo cp *.a /usr/lib
# Now get and the other dependencies
cd $source_dir
mkdir third_party
cd third_party
# Download Dyninst.
wget https://github.com/dyninst/dyninst/archive/v9.3.2.tar.gz
tar xvf ./v9.3.2.tar.gz
# Download PicoSHA, pe-parse, SPII and the C++ JSON library.
git clone https://github.com/okdshin/PicoSHA2.git
git clone https://github.com/trailofbits/pe-parse.git
git clone https://github.com/PetterS/spii.git
mkdir json
mkdir json/src
cd json/src
wget https://github.com/nlohmann/json/releases/download/v3.1.2/json.hpp
cd ../..
# Build PE-Parse.
cd pe-parse
cmake ./CMakeLists.txt
make -j8
cd ..
# Build SPII.
cd spii
cmake ./CMakeLists.txt
make -j8
sudo make install
cd ..
# Build Dyninst
cd dyninst-9.3.2
cmake ./CMakeLists.txt
make -j8
sudo make install
sudo ldconfig
cd ..
# Finally build functionsimsearch tools
cd ..
make -j8
This should build the relevant executables to try. On Debian stretch and later, you may have to add '-fPIC' into the pe-parse CMakeLists.txt to make sure your generated library supports being relocated.
Running the tests
You can run the tests by doing:
cd tests
./runtests
./slowtests
Note that the tests use relative directories, assuming that you actually changed your directory, so running
tests/runtests
will not work.
Also be aware that a fair number of the tests are pretty expensive to run, and expect the full testsuite to eat all your CPU for a few minutes; the suite of slow tests may keep your computer busy for hours.
Running the tools
At the moment, the following executables will be built (in alphabetical order):
addfunctionstoindex
./addfunctionstoindex -format=ELF -input=/bin/tar -index=./function_search.index -minimum_function_size=5 -weights=./weights.txt
./addfunctionstoindex -format=PE -input=~/sources/mpengine/engine/mpengine.dll -index=./function_search.index -minimum_function_size=5
Disassemble the specified input file, find functions with more than 5 basic blocks, calculate the SimHash for each such function and add it to the search index file.
addsinglefunctiontoindex
./addsinglefunctiontoindex -format=ELF -input=/bin/tar -index=./function_search.index -function_address=0x40deadb -weights=./weights.txt
./addsinglefunctionstoindex -format=PE -input=~/sources/mpengine/engine/mpengine.dll -index=./function_search.index -function_address=0x40deadb
Disassemble the specified input file, then find the function at the specified address and at it to the search index. Incurs the full cost of disassembling the entire executable, so use with care.
createfunctionindex
./createfunctionindex -index=./function_search.index
Creates a file to use for the function similarity search index. Most likely the first command you want to run.
disassemble
./disassemble -format=ELF -input=/bin/tar
./disassemble -format=PE -input=~/sources/mpengine/engine/mpengine.dll
Disassemble the specified file and dump the disassembly to stdout. The input file can either be a 32-bit/64-bit ELF, or a 32-bit PE file. Adding support for 64-bit PE is easy and will be done soon.
dotgraphs
./dotgraphs -format=ELF -input=/bin/tar -output=/tmp/graphs
./dotgraphs -format=PE -input=~/sources/mpengine/engine/mpengine.dll -output=/tmp/graphs
Disassemble the specified file and write the CFGs as dot files to the specified directory.
dumpfunctionindex
./dumpfunctionindex -index=./function_search.index
Dumps the content of the search index to text. The content consists of 5 text colums:
| HashID | SimHash First Part | SimHash Second Part | Executable ID | Address | | ------ | ------------------ | ------------------- | ------------- | ------- | | ... | ... | ... | ... | ... |
dumpfunctionindexinfo
./dumpfunctionindexinfo -index=./function_search.index
Prints information about the index file - how much space is used, how much space is left, how many functions are indexed etc.
Example output:
[!] FileSize: 537919488 bytes, FreeSpace: 36678432 bytes
[!] Indexed 270065 functions, total index has 7561820 elements
dumpsinglefunctionfeatures
./dumpsinglefunctionfeatures -format=ELF -input=/bin/tar -function_address=0x43AB900
Disassembles the input file, finds the relevant function, and dumps the 64-bit IDs of the features that will be used for the SimHash calculation to stdout. You will probably not need this command unless you experiment with the machine learning features in the codebase.
evalsimhashweights
./evalsimhashweights -data=datadirectory -weights=./weights.txt
Evaluates the weight file specified on labeled data in /datadirectory. Refer to the tutorial about weight learning for details.
functionfingerprints
./functionfingerprints -format=ELF -input=/bin/tar -minimum_function_size=5 -verbose=true
./functionfingerprints -format=PE -input=~/sources/mpengine/engine/mpengine.dll -minimum_function_size=5 -verbose=false
Disassembles the target file and all functions therein. If the last argument (verbose) is set to "false", this tool will simply dump the SimHash hashes of the functions in the specified executable to stdout, in the format:
FileID:Address SimHashA SimHashB
If verbose is set to "true", the tool will dump the feature IDs of the features that enter the SimHash calculation, so the output will look like:
FileID:Address Feature1 Feature2 ... FeatureN
FileID:Address FeatureM ... FeatureK
The features themselves are 128-bit hashes. The output of the tool in verbose mode is used to create training data for the machine learning components.
graphhashes
./graphhashes -format=ELF -input=/bin/tar
./graphhashes -format=PE -input=~/sources/mpengine/engine/mpengine.dll
Disassemble the specified file and write a hash of the cfg structure of each disassembled function to stdout. These hashes encode only the graph structure and completely ignore any mnemonics; as such they are not very useful on small graphs.
growfunctionindex
./growfunctionindex -index=./function_search.index -size_to_grow=512
Expand the search index file by 512 megabytes. Index files unfortunately cannot be dynamically resized, so when one nears being full, it is a good idea to grow it.
matchfunctionsfromindex
./matchfunctionsfromindex -format=ELF -input=/bin/tar -index=./function_search.index -minimum_function_size=5 -max_matches=10 -minimum_percentage=0.90 -weights=./weights_to_use.txt
./matchfunctionsfromindex -format=PE -input=~/sources/mpengine/engine/mpengine.dll -index=./function_search.index -minimum_function_size=5 -max_matches=10 -minimum_percentage=.90
Disassemble the specified input file, and for each function with more than 5 basic blocks, retrieve the top-10 most similar functions from the search index. Each match must be at least 90% similar.
trainsimhashweights
./trainsimhashweights -data=/tmp/datadir -train_steps=500 -weights=./trained_weights.txt
A command line tool to infer feature weights from examples. Uses the data in the specified data directory, trains for 500 iterations (using LBFGS), and then writes the resulting weights to the specified file.
End-to-end tutorial: How to build an index of vulnerable functions to scan for
Let's assume that weights have been trained already, and placed in a file called "trained_weights_500.txt".
# Create a new index file.
bin/createfunctionindex --index="./trained.index"
# Grow the index to be 2 gigs in size.
bin/growfunctionindex --index="./trained.index" --size_to_grow=2048
# Add DLLs with interesting functions to the search index.
for dll in $(find -iname \*.dll); do \
bin/addfunctionstoindex -format=PE -index=,/trained.index -weights=./trained_weights_500.txt --input $(pwd)/$dll; done
#
