Overview

What is a DAWG/DAFSA?

A DAWG (Directed Acyclic Word Graph), also known as DAFSA (Deterministic Acyclic Finite State Automaton) is an interesting graph data structure used for storing strings with a minimal memory usage while maintaining very good query time for both search and prefix search. In the same way that the Trie data structure merges common prefixes, in a DAWG common suffixes are also merged.

A DAWG can be represented in a compact way as a sequence of values (usually 4 bytes values), being its size in disk almost the same as its size in memory. The size of a DAWG depends on the amount of common prefixes and suffixes that a list of words have, so the DAWG may or may not outperform GZIP in the compression rate. For example in my implementation the 2019 Collin's Scrabble wordlist with 279,496 words and a file size of 4,123 kB results in DAWG of 712 kB, while GZIP produces a 706 kB file; but the Polish Scrabble wordlist with 3,043,375 words and a file size of 38,827 kB results in a DAWG of 1,575 kB, while GZIP produces a 7,720 kB file.

Nevertheless, when reducing a Trie into a DAWG if all common suffixes are merged it may lead to a graph that has a reduced number of nodes, but that cannot be represented as a sequence of values, and therefore would take more space. Testing against other DAWG library I could see that for the same file the Trie was reduced to the half of the nodes of my implementation and was around 25-30% faster, but outputs a DAWG file that is 25-30% bigger.

Building a DAWG is an expensive process since the graph needs to be analyzed and processed. GZIP can be more than 5x times faster at times. However a DAWG is queryable while a GZIP file needs to be expanded and maybe processed into another data structure (eg: a hashset). Creating a Trie, which is the first step of creating a DAWG, has a time complexity of O(nm), where n is the number of words and m the average word length. Subsequent steps to build a DAWG merging common sufixes may involve multiple graph DFS passes depending on the implementation.

Queyring a DAWG has a time complexity of O(cw), where c is the number of different characters and w the word to search length. In the mentioned Collin's wordlist, 26 characters and the longest word has 15 characters, making the DAWG very fast. Having the words in a hashset provides a faster query time with O(1) time complexity, but its size in memory would be bigger and it would not be possible to do a prefix search.

This project

There are several algorithms to build a DAWG, some more efficient than others. For this purely self-didactic project, inspired in the idea of having a hash that represents a node and its descendency, I have developed a simple one also based on hashing that allows it not only to find nodes that are equivalent in both value and descendency, but also to quickly identify equivalent sibling lists. Hashing is fast but not bulletproof and a collision is possible, although I have tested current implementation in many files and used two different hash calculations combined, it is statistically possible. That is why I have included a method to check that the generated DAWG contains the expected words. Good news is that hashing is just used to classify nodes siblingness, and once the DAWG is built is not necessary anymore. In other words, once the DAWG is created there is no need to worry about collisions, and if you get a collision when building it you may need to salt a little bit the hashing algorithm.

This is a proof of concept and should not use in production unless you know what you are doing. However if you use it and get into a wordlist that fails during DAWG validation... let me know, I will be definitely curious :)

Building a memory optimized DAWG

Let's build a DAWG with the following words: COP, COPS, CUP, CUPS, HOP, HOPS, HUP, HUPS, TAP, TAPS, TOP, TOPS, TUP, TUPS.

The Trie

The first step is to build a Trie. I created some pictures to illustrate the process:

The gray node is the root and has no meaning beyond being the common parent for all the first letters.
Nodes with a bold circunferene are terminal nodes, meaning that a word exists in that location.
Green nodes are the last node in a list of sibblings, something we will need to know when building the final representation of the DAWG.
The orange rectangle points to where the next step is gonna happen.

Hashing nodes

Nodes are equivalent, and therefore mergeable, if they have the same terminal status, equivalent siblings and equivalent descendancy.

Performing those comparisons recursively each time is impractical, so as other DAWG algorithms my implementation relies on hashing to perform these comparisons in an efficient manner. For each node, multiple hashes are calculated:

A tuple with two hashes that represent the node value and its descendency. Both hashes use the node value, children count, distance to further leaf and its descendency accumulated hash, but they are calculated in different ways to minimize the chance of a collision. Both hashing algorithms were able to process all my test files when used in isolation, so the combination of both should work best.
A hash per possible sibbling list suffix. For a list of nodes ABC, hashes are generated for ABC, BC and C. Therefore if there s a list BC that is equivalent, ABC could be reused starting at index 1.

Also during this step, a cache is recording which is the bigger list that contains a specific sublist hash, so bigger lists can be reused to represent as many smaller lists as possible.

Merging nodes

Note this is not the actual way this implementation works. It just illustrates the type of merges that are allowed.

Starting from the bottom, parent nodes of a given node which have same destinations and brothers can be merged.

The S nodes can be merged.

The P nodes can be merged.

Equivalent sibling lists can be merged. This means that both sibling lists have the same destination and hiearchy.

When a sibbling list A has the same values and targets than a part of other sublist B, the merge is also possible, but only if A is a suffix of B. This is important because during serialization the last item of B will be marked as LastSibling and this must be true for A as well.

This would be the result graph after doing all the possible merges.

Store it as an array

The final step is to use an array of 4-byte values to represent the resulting graph.

Rather than representing the children of each node as a list of nodes (or pointers to nodes), we can have for each node a single pointer indicating which is the first child, and all the nodes following the first child are considered siblings until the last one is detected.

Then it is possible to represent the nodes as a sequence, where each node can have only a reference to another single node, and still be able to represent the graph.

To store each node as 4 byte value, the 32 bits are distributed as follows:

8 bits for the value: 256 different symbols can be used in the DAWG.
22 bits for the first child reference: 4,194,304 nodes can be stored in the DAWG.
1 bit to indicate if the node is terminal.
1 bit to indicate if the node is the last sibling.

The algorithm

The basic steps of this implementation are:

Build a trie.
Do bottom-up DFS generating all the hashes.
Remember the longest sibling list for each siblings hash.
Sort in descending order the list of longest sibling lists and number the nodes sequentially. Enqueue the nodes of these lists.
Write in the DAWG the children of the start node (which are indeed the first letters of every word in the source wordlist).
Write all the nodes in the queue to the DAWG using the provided number as array index.

What is good about it?

It is fast and the compression ratio is very good.

What is not good about it?

Relies only in hashing. A hash collision would produce an inconsistent DAWG. That is why a Dawg.Verify method is provided to detect an inconsistent DAWG.

To go through the DAWG requires to jump back and forward in the array, in order words, a child node may have a smaller array index than a parent, making it more difficult to detect a loop. An inconsistent DAWG may fall into a endless loop.

Using the library

If you want to give it a try, you can either use it as a library in an application, or install it directly from Nuget as a console tool.

Console tool

If you have dotnet 3.1 installed, you can install the tool directly from Nuget

Install the tool

$ dotnet tool install --global vtortola.Dawg.Tool --version 0.0.1-alpha
You can invoke the tool using the following command: dawg
Tool 'vtortola.dawg.tool' (version '0.0.1-alpha') was successfully installed.

Creating a DAWG from a wordlist file

To build a DAWG file from a wordlist just pipe the lines through it with the subcommand build and redirect it to the desired file.

$ cat espanol.txt | dawg build > espanol.dawg

$ ls -lh espanol*
-rw-r--r--  1 vtortola  users   169K Sep  5 18:39 espanol.dawg
-rw-r--r--  1 vtortola  users   1.9M Aug  1 01:45 espanol.txt

Creating a wordlist from a DAWG file

To obtain the wordlist from the DAWG file, just pipe the content through the tool with the unbuild command and redirect to a file.

$ cat espanol.dawg | dawg unbuild > espanol.copy.txt

However keep in mi

Dawg

Install / Use

README