Bit relationships for 4 rounds of SHA-256 Bit relationships after 4-round SHA-256. 17806 nodes, 26383 edges.

Lossy "pseudo-hash" visualization Lossy "pseudo-hash" used to validate solver accuracy, showing relationship between input and output bits.

Description
Installation
- Python
- C++
Quickstart
- Writing your own hash function
How it works
- Dataset generation
- Dataset format
Solving Methods
References and Resources

Description

This repository contains Python and C++ code which attempts to reverse one-way cryptographic hash functions, with specific focus on SHA-256. A hash function f can be thought of as an operation on bits X to produce output bits Y: f(X) = Y. Given knowledge of Y and how f works, we want to find some bits X' such that f(X') = Y. This is commonly known as a preimage attack. Note that X does not necessarily need to equal X'.

A successful preimage attack has serious implications for basically the entire Internet, financial community, and national defense of major governments. Hash functions are used in all kinds of domains: from BitCoin mining and transactions, to HTTPS encryption, to storage of user passwords in server databases.

I've spent a long time (too long!) trying to solve this "impossible" problem using a variety of methods, detailed below. As a disclaimer: I do not claim that any of these methods break the security of the full 64-round SHA-256 hash function. It's probably still infeasible. Prove me wrong :)

Installation

It is not necessary to install everything listed here. If you just want to run certain parts of the code (e.g. C++ vs. Python) you can skip some steps. I developed everything on a Mac, so most things will translate well to Linux but maybe not to Windows.

Python

Use Python 3.6 with Anaconda. Start by creating and activating a new environment:

conda create -n preimage python=3.6
conda activate preimage
python --version  # Should show 3.6

Install project dependencies:

python -m pip install -r requirements.txt
python -m pip install -i https://pypi.gurobi.com gurobipy

Not all of the solving methods will work out-of-the-box from here. If you want certain solvers, you will need to install them individually:

For Cplex on Mac OS X, make sure to do:

export PYTHONPATH=/Applications/CPLEX_Studio1210/cplex/python/3.6/x86-64_osx

C++

You need to install cmake and cpp-yaml. Then to compile the code in this repository, nagivate to the belief_propagation directory and run:

$ ./kompile

Quickstart

Generate datasets:

$ ./generate_all_hashes.sh

Choose a dataset:

$ ls -la ./data
drwxr-xr-x  27 trevphil  staff  864 Nov  3 23:04 .
drwxr-xr-x  15 trevphil  staff  480 Nov 22 12:29 ..
drwxr-xr-x  11 trevphil  staff  352 Nov  3 22:54 addConst_d1
drwxr-xr-x  11 trevphil  staff  352 Nov  3 22:53 add_d1
drwxr-xr-x  11 trevphil  staff  352 Nov  3 22:54 andConst_d1
drwxr-xr-x  10 trevphil  staff  320 Nov  3 22:54 sha256_d1
...

The "d1" at the end means that the difficulty level is "1". The full SHA-256 algorithm uses a difficulty of 64 (because it has 64 rounds). Run a solver on the dataset of your choice:

$ python -m optimization.main data/addConst_d1 --solver ortools_cp

To see the available solvers and usage of the tool:

$ python -m optimization.main --help
usage: main.py [-h]
               [--solver {gradient,gnc,cplex_milp,cplex_cp,ortools_cp,ortools_milp,gurobi_milp,minisat,crypto_minisat}]
               dataset

Hash reversal via optimization

positional arguments:
  dataset               Path to the dataset directory

optional arguments:
  -h, --help            Show this help message and exit
  --solver {gradient,gnc,cplex_milp,cplex_cp,ortools_cp,ortools_milp,gurobi_milp,minisat,crypto_minisat}
                        The solving technique

To run the C++ belief propagation code after having compiled it with ./kompile, choose one of the config files and run the following command from the belief_propagation directory:

$ ./main config/config_file_name.yaml

Writing your own hash function

Take a look at the existing hash functions in dataset_generation/hash_funcs.py. For example, this function simply adds a constant value to the input:

class AddConst(SymbolicHash):
    def hash(self, hash_input: SymBitVec, difficulty: int):
        n = len(hash_input)
        A = SymBitVec(0x4F65D4D99B70EF1B, size=n)
        return hash_input + A

Implement your own hash function class which inherits from SymbolicHash and has the function hash(...), and add your class to the dictionary returned by the function hash_algorithms() in dataset_generation/hash_funcs.py. The following operations are supported for primitives (SymBitVec objects) in the hash function:

AND: C = A & B
OR: C = A | B
XOR: C = A ^ B
NOT (aka INV for "inverse"): B = ~A
Addition: C = A + B
- A, B, and C should have the same number of bits, overflow is ignored
Shift left: B = (A << n)
- B will have the same number of bits as A
Shift right: B = (A >> n)
- B will have the same number of bits as A

To generate a dataset using your hash function, run a command like the following:

$ python -m dataset_generation.generate --num-samples 64 --num-input-bits 64 --hash-algo my_custom_hash --visualize --difficulty 4

The dataset is explained more in-depth below. To see the usage of the dataset generation tool, run:

python -m dataset_generation.generate --help
usage: generate.py [-h] [--data-dir DATA_DIR] [--num-samples NUM_SAMPLES]
                   [--num-input-bits NUM_INPUT_BITS]
                   [--hash-algo {add,addConst,andConst,invert,lossyPseudoHash,nonLossyPseudoHash,orConst,sha256,shiftLeft,shiftRight,xorConst}]
                   [--difficulty DIFFICULTY] [--visualize]
                   [--hash-input HASH_INPUT] [--pct-val PCT_VAL]
                   [--pct-test PCT_TEST]

Hash reversal dataset generator

optional arguments:
  -h, --help            Show this help message and exit
  --data-dir DATA_DIR   Path to the directory where the dataset should be
                        stored
  --num-samples NUM_SAMPLES
                        Number of samples to use in the dataset
  --num-input-bits NUM_INPUT_BITS
                        Number of bits in each input message to the hash
                        function
  --hash-algo {add,addConst,andConst,invert,lossyPseudoHash,nonLossyPseudoHash,orConst,sha256,shiftLeft,shiftRight,xorConst}
                        Choose the hashing algorithm to apply to the input
                        data
  --difficulty DIFFICULTY
                        SHA-256 difficulty (an interger between 1 and 64
                        inclusive)
  --visualize           Visualize the symbolic graph of bit dependencies
  --hash-input HASH_INPUT
                        Give input message in hex to simply print the hash of
                        the input
  --pct-val PCT_VAL     Percent of samples used for validation dataset
  --pct-test PCT_TEST   Percent of samples used for test dataset

How it works

Dataset generation

The first thing we need in order to approach this problem is to find out how the hash function f operates on input bits X to produce output bits Y without making any assumptions about X. To this end, I make the hash functions operate on symbolic bit vectors, i.e. SymBitVec objects. These objects track the relationship between input and output bits for all computations in the hash function, e.g. if C = A & B, we track the relationship that bit C is the result of AND-ing A and B. Ultimately after all of the computations, we will have the relationship between each bit of Y and each bit of X in the function f(X) = Y.

Some simplifications can be made during this process. For example, let's say that bit A is a bit from the unknown input X and bit B is a constant in the hash algorithm equal to 0. Well then we know that C = A & 0 = 0 so C = 0 no matter the value of A. Some more simplifications for operations on single bits are listed below:

B = A & 1 = A
B = A | 0 = A
B = A | 1 = 1
B = A ^ 1 = ~A
B = A ^ 0 = A
B = A ^ A = 0
B = A & A = A | A = A

These simplifications help to reduce the size of the symbolic representation of the hash function, since the output bit B is sometimes a constant or equal to the unknown input A. When this happens, we don't need to introduce a new unknown variable.

Furthermore, the problem can be made easier to handle by reducing all operations (XOR, OR, addition) to only using AND and INV logic gates. For example, C = A ^ B is equivalent to:

X = ~(A & B)
Y = ~(A & X)
Z = ~(B & X)
C = ~(Y & Z)

This does introduce intermediate variables, but critically, the AND and INV operations can be linearized and also represented in the con

Cryptosym

Install / Use

README

Contents