NEMO: Modeling Password Guessability Using Markov Models

tl;dr

This is our ongoing effort of using Markov models to build probabilistic password models. Common use cases include:

Strength estimation
Guessing
(Adaptive) Natural Language Encoders
...

WARNING

This is research-quality code that should only be used for a proof of concept (PoC).
We share this code in the hope that the research community can benefit from it. Please share your code, too! :heart_eyes:
We recommended running this software using PyPy (see performance stats below).

About NEMO

The scope of this project is not limited to passwords, this software has also been used in the context of other human-chosen secrets like Emoji, PINs, and Android unlock patterns.

The architecture of the software is inspired by OMEN. More background information about OMEN can be found here and here. An excellent Python implementation of OMEN, called py_omen, by Matthew Weir (@lakiw) can be found here.

Difference to OMEN

OMEN makes use of so-called levels (a form of binning). This implementation does not. Thus, efficient enumeration of password candidates (guessing passwords as OMEN does), is not (out of the box) possible, if the key space becomes too big. However, because of the non-binned output, this software has other advantages, for example it can produce more accurate strength estimates.

Overview: Markov Model-Based Password Guessing

In 2005, Arvind Narayanan and Vitaly Shmatikov proposed the use of Markov models to overcome some problems of dictionary-based password guessing attacks in their work Fast Dictionary Attacks on Passwords Using Time-Space Tradeoff. The idea behind Markov models is based on the observation that subsequent tokens, such as letters in a text, are rarely independently chosen and can often be accurately modeled based on a short history of tokens.
In 2008, the popular password cracker John the Ripper introduced a -markov mode. More details can be found here, here, and here. Simon Marechal (@bartavelle) compared this Markov model-based approach with various other guessing techniques in his work Advances in Password Cracking.
In 2010, Dell’Amico et al. used a Markov model-based approach to guess passwords in their work Measuring Password Strength: An Empirial Analysis.
In 2012, Castelluccia et al. (2012) and Dürmuth et al. (2015) improved the concept by generating password candidates according to their occurrence probabilities, i.e., by guessing the most likely passwords first. Please refer to their works, Adaptive Password-Strength Meters from Markov Models, OMEN: Faster Password Guessing Using an Ordered Markov Enumerator, and When Privacy Meets Security: Leveraging Personal Information for Password Cracking for more details.
In 2014, Ma et al. discussed other sources for improvements such as smoothing, backoff models, and issues related to data sparsity in their excellent work A Study of Probabilistic Password Models.
In 2015, Matteo Dell’Amico and Maurizio Filippone published their work on Monte Carlo Strength Evaluation: Fast and Reliable Password Checking. Their backoff Markov model can be found on GitHub, too. :heart_eyes:
In 2015, Ur et al. compared various password cracking methods in their work Measuring Real-World Accuracies and Biases in Modeling Password Guessability. For Markov model-based attacks they used a copy of Ma et al.'s code, which is now available via Carnegie Mellon University's Password Guessability Service (PGS) where it is called "Markov Model: wordlist-order5-smoothed."
In 2016, Melicher et al. compared their RNN-based approach to a Markov model in their work Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks. While some details are missing, their model can be found on GitHub, too. :heart_eyes:

Publications

In the past, we used different versions of this code in the following publications: :bowtie:

IEEE SP 2019: Reasoning Analytically About Password-Cracking Software (Markov: Multi)
ACM CCS 2018: On the Accuracy of Password Strength Meters (ID: 4B/4C Markov (Single/Multi))
ACM CCS 2016: On the Security of Cracking-Resistant Password Vaults (Markov Model)

A simpler version of this code has been used for other user-chosen secrets such as Emoji and Android unlock patterns.

Design Decisions

Warning: Markov models are memory-eating monsters! :smiling_imp:

We use three copies of a data structure (in the past: Python OrderedDicts(), today: plain Python lists) to store the frequencies of the n-grams in the training corpus. We use:

IP: Initial probabilities (ngram_size - 1)
CP: Conditional probabilities
EP: End probabilities (ngram_size - 1)

Here is an example for 3-grams:

password PW

pa       IP  (some literature uses this annotation: ^pa)

pas      CP1
 ass     CP2
  ssw    CP3
   swo   CP4
    wor  CP5
     ord CP6

      rd EP  (some literature uses this annotation: rd$)

How Big Are They?

IP: alphabet_length ^ (ngram_size - 1)
CP: alphabet_length ^  ngram_size
EP: alphabet_length ^ (ngram_size - 1)

Some Details For the Ones Interested:

:nerd_face:

n-gram size: Currently, we support 2,3,4,5-grams. The higher the order of the Markov chains, the more accurate the model becomes. Unfortunately, this also introduces the risk of overfitting and sparsity. If one does not have enough training data, e.g., when using the model with Android unlock patterns, computing the transition probabilities from too small count numbers will become too noisy. While we only support fixed-order Markov chains, we recommend using Dell’Amico and Filippone backoff model for variable-order Markov chains.

Smoothing: Currently, we only support Additive smoothing (add '1' to the counts), also known as Laplace smoothing.

Alphabet: We tested this software with ASCII passwords only. Using non-ASCII passwords, likely requires to drop the support for Python 2 first. Hint: You can use the info.py script in the utils folder to determine the alphabet.

Development

In early versions of this code, we made heavy use of Python's (Ordered)-Dictionary class. Fun fact: As of Python 3.7 dictionaries are always ordered :)

cp_dict_full:
    key: "aaaa", value: 0.0071192...
    key: "aaab", value: 0.0034128...
    ...

A few months later, we optimized the memory consumption by only storing n-grams that really occur in the training corpus. If a rare n-gram like the 4-gram o9py does not occur in the training file, we used to return a very small default probability instead. This helped quite a lot to reduce the required memory, still, like Google Chrome, our solution easy occupied more than 20GB of RAM. :poop:

cp_dict_sparse:
    key: "assw", value: 0.0838103...
    key: "sswo", value: 0.0954193...
    ...

Thus, we decided to refactor the code again to further limit the amount of required memory to nowadays approx. 16GB of RAM. Today, we use simple Python lists to store the n-gram probabilities in memory. However, this force

NEMO

Install / Use

README