Cryptosym
SAT-based, MILP, and belief propagation preimage attacks on SHA-256 and other cryptographic hash functions
Install / Use
/learn @trevphil/CryptosymREADME
Bit relationships after 4-round SHA-256. 17806 nodes, 26383 edges.
Lossy "pseudo-hash" used to validate solver accuracy, showing relationship between input and output bits.
Contents
Description
This repository contains Python and C++ code which attempts to reverse one-way cryptographic hash functions, with specific focus on SHA-256. A hash function f can be thought of as an operation on bits X to produce output bits Y: f(X) = Y. Given knowledge of Y and how f works, we want to find some bits X' such that f(X') = Y. This is commonly known as a preimage attack. Note that X does not necessarily need to equal X'.
A successful preimage attack has serious implications for basically the entire Internet, financial community, and national defense of major governments. Hash functions are used in all kinds of domains: from BitCoin mining and transactions, to HTTPS encryption, to storage of user passwords in server databases.
I've spent a long time (too long!) trying to solve this "impossible" problem using a variety of methods, detailed below. As a disclaimer: I do not claim that any of these methods break the security of the full 64-round SHA-256 hash function. It's probably still infeasible. Prove me wrong :)
Installation
It is not necessary to install everything listed here. If you just want to run certain parts of the code (e.g. C++ vs. Python) you can skip some steps. I developed everything on a Mac, so most things will translate well to Linux but maybe not to Windows.
Python
Use Python 3.6 with Anaconda. Start by creating and activating a new environment:
conda create -n preimage python=3.6
conda activate preimage
python --version # Should show 3.6
Install project dependencies:
python -m pip install -r requirements.txt
python -m pip install -i https://pypi.gurobi.com gurobipy
Not all of the solving methods will work out-of-the-box from here. If you want certain solvers, you will need to install them individually:
For Cplex on Mac OS X, make sure to do:
export PYTHONPATH=/Applications/CPLEX_Studio1210/cplex/python/3.6/x86-64_osx
C++
You need to install cmake and cpp-yaml. Then to compile the code in this repository, nagivate to the belief_propagation directory and run:
$ ./kompile
Quickstart
Generate datasets:
$ ./generate_all_hashes.sh
Choose a dataset:
$ ls -la ./data
drwxr-xr-x 27 trevphil staff 864 Nov 3 23:04 .
drwxr-xr-x 15 trevphil staff 480 Nov 22 12:29 ..
drwxr-xr-x 11 trevphil staff 352 Nov 3 22:54 addConst_d1
drwxr-xr-x 11 trevphil staff 352 Nov 3 22:53 add_d1
drwxr-xr-x 11 trevphil staff 352 Nov 3 22:54 andConst_d1
drwxr-xr-x 10 trevphil staff 320 Nov 3 22:54 sha256_d1
...
The "d1" at the end means that the difficulty level is "1". The full SHA-256 algorithm uses a difficulty of 64 (because it has 64 rounds). Run a solver on the dataset of your choice:
$ python -m optimization.main data/addConst_d1 --solver ortools_cp
To see the available solvers and usage of the tool:
$ python -m optimization.main --help
usage: main.py [-h]
[--solver {gradient,gnc,cplex_milp,cplex_cp,ortools_cp,ortools_milp,gurobi_milp,minisat,crypto_minisat}]
dataset
Hash reversal via optimization
positional arguments:
dataset Path to the dataset directory
optional arguments:
-h, --help Show this help message and exit
--solver {gradient,gnc,cplex_milp,cplex_cp,ortools_cp,ortools_milp,gurobi_milp,minisat,crypto_minisat}
The solving technique
To run the C++ belief propagation code after having compiled it with ./kompile, choose one of the config files and run the following command from the belief_propagation directory:
$ ./main config/config_file_name.yaml
Writing your own hash function
Take a look at the existing hash functions in dataset_generation/hash_funcs.py. For example, this function simply adds a constant value to the input:
class AddConst(SymbolicHash):
def hash(self, hash_input: SymBitVec, difficulty: int):
n = len(hash_input)
A = SymBitVec(0x4F65D4D99B70EF1B, size=n)
return hash_input + A
Implement your own hash function class which inherits from SymbolicHash and has the function hash(...), and add your class to the dictionary returned by the function hash_algorithms() in dataset_generation/hash_funcs.py. The following operations are supported for primitives (SymBitVec objects) in the hash function:
- AND:
C = A & B - OR:
C = A | B - XOR:
C = A ^ B - NOT (aka INV for "inverse"):
B = ~A - Addition:
C = A + BA,B, andCshould have the same number of bits, overflow is ignored
- Shift left:
B = (A << n)Bwill have the same number of bits asA
- Shift right:
B = (A >> n)Bwill have the same number of bits asA
To generate a dataset using your hash function, run a command like the following:
$ python -m dataset_generation.generate --num-samples 64 --num-input-bits 64 --hash-algo my_custom_hash --visualize --difficulty 4
The dataset is explained more in-depth below. To see the usage of the dataset generation tool, run:
python -m dataset_generation.generate --help
usage: generate.py [-h] [--data-dir DATA_DIR] [--num-samples NUM_SAMPLES]
[--num-input-bits NUM_INPUT_BITS]
[--hash-algo {add,addConst,andConst,invert,lossyPseudoHash,nonLossyPseudoHash,orConst,sha256,shiftLeft,shiftRight,xorConst}]
[--difficulty DIFFICULTY] [--visualize]
[--hash-input HASH_INPUT] [--pct-val PCT_VAL]
[--pct-test PCT_TEST]
Hash reversal dataset generator
optional arguments:
-h, --help Show this help message and exit
--data-dir DATA_DIR Path to the directory where the dataset should be
stored
--num-samples NUM_SAMPLES
Number of samples to use in the dataset
--num-input-bits NUM_INPUT_BITS
Number of bits in each input message to the hash
function
--hash-algo {add,addConst,andConst,invert,lossyPseudoHash,nonLossyPseudoHash,orConst,sha256,shiftLeft,shiftRight,xorConst}
Choose the hashing algorithm to apply to the input
data
--difficulty DIFFICULTY
SHA-256 difficulty (an interger between 1 and 64
inclusive)
--visualize Visualize the symbolic graph of bit dependencies
--hash-input HASH_INPUT
Give input message in hex to simply print the hash of
the input
--pct-val PCT_VAL Percent of samples used for validation dataset
--pct-test PCT_TEST Percent of samples used for test dataset
How it works
Dataset generation
The first thing we need in order to approach this problem is to find out how the hash function f operates on input bits X to produce output bits Y without making any assumptions about X. To this end, I make the hash functions operate on symbolic bit vectors, i.e. SymBitVec objects. These objects track the relationship between input and output bits for all computations in the hash function, e.g. if C = A & B, we track the relationship that bit C is the result of AND-ing A and B. Ultimately after all of the computations, we will have the relationship between each bit of Y and each bit of X in the function f(X) = Y.
Some simplifications can be made during this process. For example, let's say that bit A is a bit from the unknown input X and bit B is a constant in the hash algorithm equal to 0. Well then we know that C = A & 0 = 0 so C = 0 no matter the value of A. Some more simplifications for operations on single bits are listed below:
B = A & 1 = AB = A | 0 = AB = A | 1 = 1B = A ^ 1 = ~AB = A ^ 0 = AB = A ^ A = 0B = A & A = A | A = A
These simplifications help to reduce the size of the symbolic representation of the hash function, since the output bit B is sometimes a constant or equal to the unknown input A. When this happens, we don't need to introduce a new unknown variable.
Furthermore, the problem can be made easier to handle by reducing all operations (XOR, OR, addition) to only using AND and INV logic gates. For example, C = A ^ B is equivalent to:
X = ~(A & B)
Y = ~(A & X)
Z = ~(B & X)
C = ~(Y & Z)
This does introduce intermediate variables, but critically, the AND and INV operations can be linearized and also represented in the con
