Suffix Tree implementation in C# based on Ukkonen's linear time construction algorithm

Automated tests on suffix tree construction and lookups:

Who is this useful for?

This repo could be useful if you're implementing Ukkonen's algorithm the first time, or perhaps if you're looking for a well-structured basic C# implementation to use as a basis for adding more advanced features to.

Additionally, I added below an extensive explanation of the algorithm, which aims to provide some supplemental explanation in a purely pragmatic, non-formal manner.

Summary

This is an implementation of suffix tree construction that runs at O(n) time complexity, compared to the O(n^2) time complexity of naïve construction, making it viable for rapidly constructing suffix trees for longer strings. But this is Wikipedia stuff, so I won't waste your time with it.

The space complexity is of course linear as well, and the footprint of nodes is relatively small, but the entity bindings use normal 64 bit references, so currently the overall memory use doesn't compete well with state-of-the-art implementations. I don't feel good about this, so I'll probably refactor it soon. At any rate, if minimal memory use is the primary concern, suffix arrays provide a better solution.

This implementation is based on my own observations and multiple cycles of refactoring to simplify the design. But it turned out my approach corresponds well to how it's described by e.g. a relatively recent paper by NJ Larsson et al., notably updating the active point mid-phase to decouple the leaf node insertion from the branching insertion.

This approach leads to a lean extension phase with little branching; in my case specifically the following:

private void ExtendTree(char c)
{
    _chars.Add(c);
    _needSuffixLink = null;
    _position++;
    _remainder++;

    while (_remainder > 0)
    {
        if (_AP.MoveDown(c))
            break;

        if (_AP.ActiveEdge != null)
            _AP.ActiveParent = InsertSplit(_AP);

        InsertLeaf(_AP, c);
        _remainder--;

        if (_remainder > 0)
            _AP.Rescan();
    }
}

Implementation details

There are a multitude of ways to implement suffix trees, especially with respect to the storage model and the representation of nodes/edges. This implementation has the following characteristics:

Central hashtable based storage: Inspired by Kurtz's paper, but for now I implemented it with reference type nodes. Performs better for large alphabets than e.g. linked list or array based storage.
No explicit edge representation: I chose to forgo representing edges as separate entities, to cut down on the complexity and the potential added cost of dereferencing. I.e. nodes store the label boundaries.
Externally stored suffix links: I store suffix links too in a central dictionary, outside of the nodes. This leads to a very lean node footprint, but at the cost of some additional lookup time.

I plan to further refactor this implementation, to decouple the components and make them modular, add new features, try different designs, etc. – if my time allows.

Tests

The implementation is unit-tested with NUnit. The testing currently consists of a few simple static tests for build/lookup, plus dynamic tests involving randomly constructed strings and randomly selected lookup substrings, running over thousands of cycles. I don't know how to fully test suffix tree algorithms; but hopefully this dynamic testing will prove to be robust enough.

Very plain English explanation of Ukkonen's algorithm

I'm using the 'very plain' prefix jokingly, because I actually struggled to follow the widely referenced Plain English StackOverflow explanation. While it's certainly a decent description, it took me days (and reading through several papers) to track down and understand the 'why' and 'how' parts of some key aspects of the algorithm. My struggle is the reason why I decided to write the explanation below.

In the following sections there is ZERO mention of 'Rules' and 'Observations' (who needs rules, we're all cowboys here ;)), and everything is 100% focused on actually explaining/understanding the underlying reasons.

I tried to focus on aspects I found to be vague in other sources. Thus, I think the most useful part of this text is the discussion of the differences between rescanning and using suffix links, as this is what gave me a lot of headaches in my implementation.

Please note that I'm not an expert on this subject, so the following sections may contain inaccuracies (or worse). If you encounter any, feel free to open an issue.

Basic rationale

Originally, suffix trees could be built with quadratic ($O(n^2)$) time complexity, which is quite prohibitive for a lot of uses. Ukkonen's algorithm reduces this time complexity to linear, essentially by employing a few 'tricks', although if Ukkonen saw Arrested Development, he might insist these aren't tricks, but illusions (apologies for the digression).

With this algorithm we can extend the suffix tree character by character, from start to end, e.g. a suffix tree for the string ABCD can be built incrementally by adding A, B, C, and D. The resulting tree structure is the same as if we built it naively by decrementally adding all full suffixes of the string, e.g. adding ABCD, BCD, BC, and D.

Open-ended leaf nodes and their limitations

The most fundamental and easy to understand 'trick' is to realize we can just leave the end of the suffixes 'open', i.e. referencing the current length of the string instead of setting the end to a static value. This way when we add additional characters, those characters will be implicitly added to all suffix labels, without having to visit and update all of them.

It's worth noting, though, that you don't have to use an actual reference type for the open end position. For simplicity you can use a regular int and just designate a special value (e.g. -1) that marks the position open.

But this open ending of suffixes – for obvious reasons – works only for nodes that represent the end of the string, i.e. the leaf nodes in the tree structure. The branching operations we execute on the tree (the addition of new branch nodes and leaf nodes) won't propagate automatically everywhere they need to.

It's probably elementary, and wouldn't require mention, that repeated substrings don't appear explicitly in the tree, since the tree already contains these by virtue of them being repetitions; however, when the repetitive substring ends by encountering a non-repeating character, we need to create a branching at that point to represent the divergence from that point onwards.

For example in case of the string 'ABCXABCY' (see below), a branching to X and Y needs to be added to three different suffixes, ABC, BC and C; otherwise it wouldn't be a valid suffix tree, and we couldn't find all substrings of the string by matching characters from the root downwards.

Once again, to emphasize – any operation we execute on a suffix in the tree needs to be reflected by its consecutive suffixes as well (e.g. ABC > BC > C), otherwise they simply cease to be valid suffixes.

UkkonenExample1

But even if we accept that we have to do these manual updates, how do we know how many suffixes need to be updated? Since, when we add the repeated character A (and the rest of the repeated characters in succession), we have no idea yet when/where do we need to split the suffix into two branches. The need to split is ascertained only when we encounter the first non-repeating character, in this case Y (instead of the X that already exists in the tree).

What we can do is to match the longest repeated string we can, and count how many of its suffixes we need to update later. This is what 'remainder' stands for.

The concept of 'remainder' and 'rescanning'

The variable remainder tells us how many repeated characters we added implicitly, without branching; i.e. how many suffixes we need to visit to repeat the branching operation once we found the first character that we cannot match. This essentially equals to how many characters 'deep' we are in the tree from its root.

So, staying with the previous example of the string ABCXABCY, we match the repeated ABC part 'implicitly', incrementing remainder each time, which results in remainder of 3. Then we encounter the non-repeating character 'Y'. Here we split the previously added ABCX into ABC->X and ABC->Y. Then we decrement remainder from 3 to 2, because we already took care of the ABC branching. Now we repeat the operation by matching the last 2 characters – BC – from the root to reach the point where we need to split, and we split BCX too into BC->X and BC->Y. Again, we decrement remainder to 1, and repeat the operation; until the remainder is 0. Lastly, we need to add the current character (Y) itself to the root as well.

This operation, following the consecutive suffixes from the root simply to reach the point where we need to do an operation is what's called 'rescanning' in Ukkonen's algorithm, and typically this

SuffixTree

Install / Use

README