Xcdat: Fast compressed trie dictionary library

Xcdat is a C++17 header-only library of a fast compressed string dictionary based on an improved double-array trie structure described in the paper: Compressed double-array tries for string dictionaries supporting fast lookup, Knowledge and Information Systems, 2017, available here.

Features
Build instructions
Command line tools
Sample usage
API
Performance
Licensing
Todo
References

Features

Compressed string dictionary. Xcdat implements a (static) compressed string dictioanry that stores a set of strings (or keywords) in a compressed space while supporting several search operations [1,2]. For example, Xcdat can store an entire set of English Wikipedia titles at half the size of the raw data. (See Performance)
Fast and compact data structure. Xcdat employs the double-array trie [3] known as the fastest trie implementation. However, the double-array trie resorts to many pointers and consumes a large amount of memory. To address this, Xcdat applies the XCDA method [2] that represents the double-array trie in a compressed format while maintaining the fast searches.
Cache efficiency. Xcdat employs a minimal-prefix trie [4] that replaces redundant trie nodes into strings to reduce random access and to improve locality of references.
Dictionary encoding. Xcdat maps N distinct keywords into unique IDs from [0,N-1], and supports the two symmetric operations: lookup returns the ID corresponding to a given keyword; decode returns the keyword associated with a given ID. The mapping is so-called dictionary encoding (or domain encoding) and is fundamental in many DB applications as described by Martínez-Prieto et al [1] or Müller et al. [5].
Prefix search operations. Xcdat supports prefix search operations realized by trie search algorithms: prefix_search returns all the keywords contained as prefixes of a given string; predictive search returns all the keywords starting with a given string. These will be useful in many NLP applications such as auto completions [6], stemmed searches [7], or input method editors [8].
64-bit support. As mentioned before, since the double array is a pointer-based data structure, most double-array libraries use 32-bit pointers to reduce memory consumption, resulting in limiting the scale of the input dataset. On the other hand, the XCDA method allows Xcdat to represent 64-bit pointers without sacrificing memory efficiency.
Binary key support. In normal mode, Xcdat will use the \0 character as an end marker for each keyword. However, if the dataset include \0 characters, it will use bit flags instead of end markers, allowing the dataset to consist of binary keywords.
Memory mapping. Xcdat supports memory mapping, allowing data to be deserialized quickly without loading it into memory. Of course, deserialization by the loading is also supported.
Header only. The library consists only of header files, and you can easily install it.
Python binding. You can use Xcdat in Python3 via pybind11. (Visit the directory pybind)

Build instructions

You can download, compile, and install Xcdat with the following commands.

$ git clone https://github.com/kampersanda/xcdat.git
$ cd xcdat
$ mkdir build
$ cd build
$ cmake ..
$ make -j
$ make install

Or, since this library consists only of header files, you can easily install it by passing the include path to the directory include.

Requirements

You need to install a modern C++17 ready compiler such as g++ >= 7.0 or clang >= 4.0. For the build system, you need to install CMake >= 3.0 to compile the library.

The library requires that std::uint64_t exists. (This is true for nearly any target, even 32-bit ones.) The code has been tested only on Mac OS X and Linux. That is, this library considers only UNIX-compatible OS.

Python binding

Xcdat supports the Python binding via pybind11. The description can be found in the directory pybind.

Command line tools

Xcdat provides command line tools to build the dictionary and perform searches, which are inspired by marisa-trie. All the tools will print the command line options by specifying the parameter -h.

The tools employ the external libraries cmd_line_parser, mm_file, and tinyformat, which are contained in the repository.

`xcdat_build`

It builds the trie dictionary from a given dataset consisting of keywords separated by newlines. The following command builds the trie dictionary from dataset enwiki-titles.txt and writes the dictionary into file dic.bin.

$ xcdat_build enwiki-titles.txt dic.bin
Number of keys: 15955763
Number of trie nodes: 36439320
Number of DA units: 36515840
Memory usage in bytes: 1.64104e+08
Memory usage in MiB: 156.502

`xcdat_lookup`

It tests the lookup operation for a given dictionary. Given a query string via stdin, it prints the associated ID if found, or -1 otherwise.

$ xcdat_lookup dic.bin
Algorithm
1255938	Algorithm
Double_Array
-1	Double_Array

`xcdat_decode`

It tests the decode operation for a given dictionary. Given a query ID via stdin, it prints the corresponding keyword if the ID is in the range [0,N-1], where N is the number of stored keywords.

$ xcdat_decode dic.bin
1255938
1255938	Algorithm

`xcdat_prefix_search`

It tests the prefix_search operation for a given dictionary. Given a query string via stdin, it prints all the keywords contained as prefixes of a given string.

$ xcdat_prefix_search dic.bin
Algorithmic
6 found
57	A
798460	Al
1138004	Alg
1253024	Algo
1255938	Algorithm
1255931	Algorithmic

`xcdat_predictive_search`

It tests the predictive_search operation for a given dictionary. Given a query string via stdin, it prints the first n keywords starting with a given string, where n is one of the parameters.

$ xcdat_predictive_search dic.bin -n 3
Algorithm
263 found
1255938	Algorithm
1255944	Algorithm's_optimality
1255972	Algorithm_(C++)

`xcdat_enumerate`

It prints all the keywords stored in a given dictionary.

$ xcdat_enumerate dic.bin | head -3
0	!
107	!!
138	!!!

`xcdat_benchmark`

Xcdat provides the four dictionary types defined in xcdat.hpp. The tool measures the performances of them for a given dataset. To perform search operations, it randomly samples n queires from the dataset, where n is one of the parameters. It will help you determine the dictionary type.

$ xcdat_benchmark enwiki-titles.txt
** xcdat::trie_7_type **
Number of keys: 15955763
Memory usage in bytes: 1.70618e+08
Memory usage in MiB: 162.714
Construction time in seconds: 13.501
Lookup time in microsec/query: 0.5708
Decode time in microsec/query: 1.0846
** xcdat::trie_8_type **
Number of keys: 15955763
Memory usage in bytes: 1.64104e+08
Memory usage in MiB: 156.502
Construction time in seconds: 13.626
Lookup time in microsec/query: 0.6391
Decode time in microsec/query: 1.0531
** xcdat::trie_15_type **
Number of keys: 15955763
Memory usage in bytes: 2.05737e+08
Memory usage in MiB: 196.206
Construction time in seconds: 13.425
Lookup time in microsec/query: 0.3613
Decode time in microsec/query: 0.7044
** xcdat::trie_16_type **
Number of keys: 15955763
Memory usage in bytes: 2.15935e+08
Memory usage in MiB: 205.932
Construction time in seconds: 13.704
Lookup time in microsec/query: 0.3832
Decode time in microsec/query: 0.8362

Sample usage

sample/sample.cpp provides a sample usage.

#include <iostream>
#include <string>

#include <xcdat.hpp>

int main() {
    // Dataset of keywords
    std::vector<std::string> keys = {
        "AirPods",  "AirTag",  "Mac",  "MacBook", "MacBook_Air", "MacBook_Pro",
        "Mac_Mini", "Mac_Pro", "iMac", "iPad",    "iPhone",      "iPhone_SE",
    };

    // The input keys must be sorted and unique (already satisfied in this case).
    std::sort(keys.begin(), keys.end());
    keys.erase(std::unique(keys.begin(), keys.end()), keys.end());

    // The trie dictionary type from the four types
    using trie_type = xcdat::trie_8_type;
    // using trie_type = xcdat::trie_16_type;
    // using trie_type = xcdat::trie_7_type;
    // using trie_type = xcdat::trie_15_type;

    // The dictionary filename
    const char* tmp_filename = "dic.bin";

    // Build and save the trie dictionary.
    try {
        const trie_type trie(keys);
        xcdat::save(trie, tmp_filename);
    } catch (const xcdat::exception& ex) {
        std::cerr << ex.what() << std::endl;
        return 1;
    }

    // Load the trie dictionary on memory.
    const auto trie = xcdat::load<trie_type>(tmp_filename);

    // Or, you can set the continuous memory block via a memory-mapped file.
    // const auto trie = xcdat::mmap<trie_type>(mapped_data);

    // Basic statistics
    std::cout << "Number of keys: " << trie.num_keys() << std::endl;
    std::cout << "Number of trie nodes: " << trie.num_nodes() << std::endl;
    std::cout << "Number of DA units: " << trie.num_units() << std::endl;
    std::cout << "Memory usage in bytes: " << xcdat::memory_in_bytes(trie) << std::endl;

    // Lookup the ID for a query key.
    {
        const auto id = trie.lookup("Mac_Pro");

Xcdat

Install / Use

README