Doxer Stylometric Data Mining Library 🕵️

Agatha Christie quote- very few of us are what we seem

Quick-start

Clone this repository:

git clone https://github.com/goldmonkey21/doxer

Add doxer to your .bashrc file:

cd ~
sudo vim .bashrc
export PATH=$PATH:/home/flak/Documents/prog/.git/doxer

Find Satoshi identity in btc folder:

cd btcforum
doxer.py -t 3satoshi

Output for Satoshi:

0.0010055437039908227 -> 237lachesis_forum.txt                      
0.001015254105998512 -> 224GavinAndresen_forum.txt
224GavinAndresen

👍 Australian born programmer Gavin Andresen is Satoshi!

Find author of original Q-source gospel:

cd christiantexts
doxer.py -t qsource

Output for Q-source:

0.001427588237734938 -> matthew-web_christian.txt 
0.0015145750641581285 -> luke-web_christian.txt
0.0015699158741751618 -> thomas-layton_christian.txt
thomas-layton

👍 Gospel of Thomas and original Christian Gospel were written by a Gnostic!

Find author of Daniel (old testament text):

cd lxx
doxer.py -t Daniel

Output for Daniel:

0.0012709272493254052 -> KingsI_lxx.txt
0.001307254569569858 -> Genesis_lxx.txt
Genesis

👍 Both books of Daniel and Genesis written by same person, pushing date of entire old testament to Hellenistic era!

Find anonymous novel Clara in benchmark:

cd novels_english
doxer.py -t Anon-Clara1864

Output for Clara novel:

0.000670058804150311 -> Blackmore-Lorna1869_english.txt
0.0007114990852890783 -> Blackmore-Erema1877_english.txt
0.0007436772047638017 -> Cbronte-Jane1847_english.txt
0.0007649839175935351 -> Cbronte-Villette1853_english.txt
Blackmore-Lorna1869

👍 And that is the correct answer... Blackmore wrote the novels Clara and Lorna!

Done!

Introduction

Simple Stylometry in Terminal

Let us start this research project with a quick word about Agatha Christie (as pictured in the quote above). I have always likened my work as a data miner to that of Christie's most famous character Miss Marple. Far from being a lone spinster, Miss Marple is able to outwit some of the most clever of criminals simply because she has read enough crime books to gain a somewhat sixth sense into their goings on. With that I will encourage you to take my argument seriously and to even install my software on your own computer. The Python library after all is self contained and needs not many extra imports. With that said, let's uncover the identity of Satoshi and maybe learn a few more lessons about data mining along the way.

What I wanted more than anything else is a stylometry program that could run easily from my Linux terminal and gather robust stylometric results without the need for a GPU. I lay feverishly in my bed trying to solve this problem while also surviving a cold. I came up with an algorithm that when I awoke immediately translated into code so as to solve most of this problem once and for all.

If you pay attention to the above diagram, you will see that Doxer is able to find the identity of the famous Russian writer Gorky by simply typing doxer.py -t Gorky-Mat The result is an immediate match without much work being put into it. The algorithm thus is unsupervised and also can work anywhere on terminal if you add it to your .bashrc file.

And by way of embarrassment I admit that I made a typo in the image above, writing 200 instead of 2,000. I analyzed all substantial texts between 1 and 2,000 on the bitcoin forum, adding an extra Adam Back just for the fans of his authorship.

Moving on to another line in the diagram above, you can see Doxer identifying a text labeled as Anon-Clara1864. This file-name means that it is titled Clara and was published in 1864. Doxer immediately uncovers the author's identity and correctly attributes Blackmore as the culprit. Here is the wiki page about the book which you can check for yourself:

https://en.wikipedia.org/wiki/Clara_Vaughan

This program can be used for a multitude of stylometry tasks, but I will contain the scope of this case study to the mystery of Satoshi Nakamoto, the inventor of Bitcoin. I will use Doxer to finalize my analysis but I will additionally employ a Random Forest on an Amazon EC2 instance to reduce the list of candidates down to a manageable (yet reasonable) amount.

In a short word, Doxer is a unique word analyzer which takes upon itself the task of finding all of the unique words that two texts share. You first take an unknown text and compare it one by one with all of the other possible candidate texts. Slowly but surely you count how many times each candidate text shares a word with only the unknown text. For example, the texts by Satoshi and Gavin may use particular words (or ngrams) that no other text in the corpus uses. If this number divided by the average-overlap-between Gavin-and-all-other-texts is highest among all candidates then it would be reasonable to suggest that out of those candidates the most likely author is Gavin himself. Of course, the algorithm will not work as well if you feed it a million texts because the overlap of words will be dispersed over the entire corpus. It is therefore necessary to reduce the corpus first so as to only analyze those texts that are already similar in style to the unknown text.

Let's give you a quick toy example. Let's imagine for a moment that Satoshi used 50 words that only Gavin and himself shared. Then let's say that Gavin shared 20 words only with Craig, 20 words only with Hal, and yet again 20 words only with Adam Back. The Doxer score would thus be 50 / ( (20 + 20 + 20) / 3 ) leaving us with a final Satoshi-Gavin score of 50 / 20. As you can see in this toy-example, Gavin shares more unique words with Satoshi than anyone else. We then proceed to repeat this process on every single one of the candidate texts and classify the highest score as Satoshi Nakamoto himself.

Additionally, I've added a nifty feature with the -o switch that allows you to print out the words that the winner share with the unknown text. Of course when I conducted a one word gram (default setting) with the Bitcoin forum, I found that the winner Gavin Andresen shared an odd phrase of 'back-of-the-envelope' only with Satoshi. As you can see, Doxer leaves punctuation intact and tries to retain as much information as possible so as to find intricate results.

And to overcome the problem of dispersion mentioned earlier, Doxer runs a quick Burrows' Delta to find the nearest neighbors of the query text. The list of top deltas can then be cut down to a predetermined amount. You may use the -r input with a specified number afterwards. For example doxer.py -t 3satoshi -r 3 will cut the dataset of over 600 texts down to the nearest 3 texts so as to find more interesting unique words between these likely authors. Keep in mind that the Burrows Delta measure does not currently include z-scores in the current program because I actually created a Random Forest to act as a reduce() function for the algorithm. Such was undertaken so as to get the best result possible. The amazing thing about the Random Forest is that it has the ability to reject all candidates so that Doxer's job doesn't have to deal with rubbish texts. I found this most useful when analyzing the Bitcoin whitepaper against around 50 other whitepapers. Every model of the Forest in fact rejected the other whitepapers, and thus I didn't have to waste time analyzing the closest neighbor. They were all rejected in one fell swoop!

I took it upon myself to create my own feature collecting function by using skip grams so as to quicken up the pace. I devised a crafty little function to put gaps in the ngrams so that regardless of the number of grams I collect, the data is always represented as 4-grams, thus making the algorithm scalable to whatever number of grams I desire. For example, a frequent 4-gram set of characters are [t,h,e,n]. A frequent 2-gram of words may also be [of,the] or even [but,the]. My skip gram would reduce the gram [the,quick,brown,fox,jumped] down to [the,quick,fox,jumped] because I'm applying the skip gram pattern of [1,1,0,1,1] with the zero representing the 'brown' gram being dropped. Here is an example of how Doxer calculates the skip grams:

from doxer import Doxer

d = Doxer()

for y in range(1,20):
	print(d.split([0 for x in range(y)]))

[1]
[1, 1]
[1, 1, 1]
[1, 1, 1, 1]
[1, 1, 0, 1, 1]
[1, 0, 1, 0, 1, 1]
[1, 0, 1, 0, 0, 1, 1]
[1, 0, 1, 0, 0, 1, 0, 1]
[1, 0, 0, 1, 0, 0, 0, 1, 1]
[1, 0, 0, 1, 0, 0, 0, 1, 0, 1]
[1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]
[1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1]
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1]
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1]
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1]
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1]
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]

General Usage

Find Satoshi in a folder of forum posts:

doxer.py -t 3satoshi -r 3

Benchmark one of the folders up above with candidate reduction of 10:

doxer.py -b -r 10

Find Satoshi using 4 character ngrams:

doxer.py -t 3satoshi -r 3 -c -n 4

Find Satoshi using 5 word ngrams:

doxer.py -t 3satoshi -r 3 -n 5

Bootstrapping

As already hinted at in the section above, stylometry has often suffered from the limitation of nearest neighbor analysis, having classified any old text in the absence of a truly worthy text to attribute authorship upon. The answer to this unfortunate problem is that of bootstrapping.

And by way of example, let's say you conduct a nearest neighbor approach on a list of novels from one hundred years ago. You compare them all to Satoshi's forum posts and find that it returns the nearest neighbor being some obscure author that had nothing whatsoever to do with cryptocurrency. Can yo

Doxer

Install / Use

README

Doxer Stylometric Data Mining Library 🕵️

Quick-start

Introduction

General Usage

Bootstrapping