Cello
A string library
Install / Use
/learn @andreaferretti/CelloREADME
Cello

Cello is a library of succinct data structures, oriented in particular for string searching and other string operations.
Usually, searching for patterns in a string takes O(n) time, where n is
the length of the string. Indices can speedup the search, but take additional
space, which can be costly for very large strings. A data structure is called
succinct when it takes n + o(n) space, where n is the space needed to store
the data anyway. Hence succinct data structures can provide additional
operations with limited space overhead.
It turns out that strings admit succinct indices, which do not take much more
space than the string itself, but allow for O(k) substring search, where k
is the length of the substring. Usually, this is much shorter, and this
considerably improves search times. Cello provide such indices and many other
related string operations.
An example of usage would be:
let
x = someLongString
pattern = someShortString
index = searchIndex(x)
positions = index.search(pattern)
echo positions
Many intermediate data structures are constructed to provide such indices, though, and as they may be of independent interest, we describe them in the following.
Notice that a string here just stands for a (usually very long) sequence of symbols taken from a (usually small) alphabet. Prototypical examples include
- genomic data, where the alphabet is
A, C, G, Tor - time series, where each value is represented by a symbol, such as
HIGH,MEDIUM,LOW, orUP,DOWN - where only two values are available, it is often convenient to store the data as bit sequences to save space.
At the moment all operations are implemented on
type AnyString = string or seq[char] or Spill[char]
where spills are just memory-mapped sequences. The library may become generic in the future, although this is not a priority.
Notice that Cello is not Unicode-aware: think more of searching large genomic strings or symbolized time series, rather then using it for internationalized text, although I may consider Unicode operations in the future.
Versions
Cello recent version (>= 0.2) requires Nim >= 0.20. For usage with Nim up to 0.19.4, use Cello 0.1.6.
Basic operations
The most common operations that we implement on various kind of sequence data
are rank and select. We first describe them for sequences of bits, which are
the foundation we use to store more complex kind of data.
For bit sequences, rank(i) counts the number of 1 bits in the first i
places. The number of 0 bits can easily be obtained as i - rank(i). Viceversa,
select(i) finds the position of the i-th 1 bit in the sequence. In this
case, there is not an obvious relation to the position of the i-th 0 bit,
so we provide a similar operation select0(i).
To ensure that rank(select(i)) == i, we define select(i) to be 1-based,
that is, we count bits starting from 1.
As a reference, we implement rank and select on Nim built-in sets, so
that for instance the following is valid:
let x = { 13..27, 35..80 }
echo x.rank(16) # 3
echo x.select(3) # 16
More generally, one can define 'rank' and select for sequence of symbols
taken from a finite alphabet, relative to a certain symbol. Here, rank(c, i)
is the number of symbols equal to c among the first i symbols, and
select(c, i) is the position of the i-th symbol c in the sequence.
Again, we give a reference implementation for strings, so that the following is valid:
let x = "ABRACADABRA"
echo x.rank('A', 8) # 4
echo x.select('A', 4) # 8
Notice that in both cases, the implementation of rank and select is a
naive implementation which takes O(i) operations. More sophisticated data
structures allow to perform similar operations in constant (for rank) or
logarithmic (for select) time, by using indices. Succinct data structures
allow to do this using indices that take at most o(n) space in addition
to the sequence data itself, where n is the sequence length.
Data structures
We now describe the succinct data structures that will generalize the bitset and the string examples above. In doing so, we also need a few intermediate data structures that may be of independent interest.
Bit arrays
Bit arrays are a generalization of Nim default set collections. They can
be seen as an ordered sequence of bool, which are actually backed by a
seq[int]. We implement random access - both read and write - as well as
naive rank and select. An example follows:
var x = bits(13..27, 35..80)
echo x[12] # false
echo x[13] # true
x[12] = true # or incl(x, 12)
echo x[12] # true
x[12] = false
echo x.rank(16) # 3
echo x.select(3) # 16
echo x.select0(30) # 90
Int arrays
Int arrays are just integer sequences of fixed length. What distinguishes
them by the various types seq[uint64], seq[uint32], seq[uint16], seq[uint8]
is that the integers can have any length, such as 23.
They are backed by a bit array, and can be used to store many integer numbers
of which an upper bound is known without wasting space. For instance, a sequence
of positive numbers less that 512 can be backed by an int array where each
number has size 9. Using a seq[uint16] would almost double the space
consumption.
Most sequence operations are available, but they cannot go after the initial capacity. Here is an example:
var x = ints(200, 13) # 200 ints at most 2^13 - 1
x.add(123)
x.add(218)
x.add(651)
echo x[2] # 651
x[12] = 1234
echo x[12] # 1234
echo x.len # 13
echo x.capacity # 200
RRR
The RRR bit vector is the first of our collections
that is actually succinct. It consists of a bit arrays, plus two int arrays
that stores rank(i) values for various i, at different scales.
It can be created after a bit array, and allows constant time rank and
logarithmic time select and select0.
let b: BitArray = ...
let r = rrr(b)
echo r.rank(123456)
echo r.select(123456)
echo r.select0(123456)
To convince oneself that the structure really is succinct, stats(rrr) returns
a data structures that shows the space taken (in bits) by the bit array, as
well as the two auxiliary indices.
Wavelet tree
The wavelet tree is a tree constructed in the following way. An input string over a finite alphabet is given. The alphabet is split in two parts - the left and the right one, call them L and R.
For each character of the string, we use a 1 bit to denote that the character belongs to R and a 0 bit to denote that it belongs to L. In this way, we obtain a bit sequence. The node stores the bit sequence as an RRR structures, and has two children: the one to the left is the wavelet tree associated to the substring composed by the characters in L, taken in order, and similarly for the right child.
This structure allows to compute rank(c, i), where c is a character in the
alphabet, in time O(log(l)), and select(c, i) in time O(log(l)log(n))
where l is the size of the alphabet and n is the size of the string.
It also allows O(log(l)) random access to read elements of the string.
It can be used as follows:
let
x = "ACGGTACTACGAGAGTAGCAGTTTAGCGTAGCATGCTAGCG"
w = waveletTree(x)
echo x.rank('A', 20) # 7
echo x.select('A', 7) # 20
echo x[12] # 'G'
Rotated strings
The next ingredient that we need it the Burrows-Wheeler transform of a string. It can be implemented using string rotations, so that's what we implement first. It turns out that this implementation is too slow for our purposes, but rotated strings may be useful anyway, so we left them in.
A rotated strings is just a view over a string, rotated by a certain amount
and wrapping around the end of the string. If the underlying string is a var,
our implementation reuses that memory (which is then shared) to avoid the
copy of the string. We just implement random access and printing:
var
s = "The quick brown fox jumps around the lazy dog"
t = s.rotate(20)
echo t[10] # n
echo t[20] # u
t[18] = e
echo s # The quick brown fox jumps around the lezy dog
echo t # jumps around the lezy dogThe quick brown fox
Suffix array
The suffix array of a string is a permutation of the numbers from 0 up to the
string length excluded. The permutation is obtained by considering, for each
i, the suffix starting at i, and sorting these strings in lexicographical
order. The resulting order is the suffix array.
Here the suffix array is represented as an IntArray. It can be obtained as follows:
let
x = "this is a test."
y = suffixArray(x)
echo y # @[7, 4, 9, 14, 8, 11, 1, 5, 2, 6, 3, 12, 13, 10, 0]
Sorting the indices may be a costly operation. One can use the fact that the suffixes of a string are a quite special collection to produce more efficient algorithms. Other than the sort-based one, we offer the DC3 algorithm.
Notice that at the moment DC3 is not really optimized and may be neither space nor time efficient.
To use an alternative algorithm, just pass an additional parameter, of type
type SuffixArrayAlgorithm* {.pure.} = enum
Sort, DC3
like this
let
x = "this is a test."
y = suffixArray(x, SuffixArrayAlgorithm.DC3)
echo y # @[7, 4, 9, 14, 8, 11, 1, 5, 2, 6, 3, 12, 13, 10, 0]
Burrows-Wheeler transform
The [Burrows-Wheeler transform](h
