Xcdat
Fast compressed trie dictionary library
Install / Use
/learn @kampersanda/XcdatREADME
Xcdat: Fast compressed trie dictionary library
Xcdat is a C++17 header-only library of a fast compressed string dictionary based on an improved double-array trie structure described in the paper: Compressed double-array tries for string dictionaries supporting fast lookup, Knowledge and Information Systems, 2017, available here.
Table of contents
- Features
- Build instructions
- Command line tools
- Sample usage
- API
- Performance
- Licensing
- Todo
- References
Features
- Compressed string dictionary. Xcdat implements a (static) compressed string dictioanry that stores a set of strings (or keywords) in a compressed space while supporting several search operations [1,2]. For example, Xcdat can store an entire set of English Wikipedia titles at half the size of the raw data. (See Performance)
- Fast and compact data structure. Xcdat employs the double-array trie [3] known as the fastest trie implementation. However, the double-array trie resorts to many pointers and consumes a large amount of memory. To address this, Xcdat applies the XCDA method [2] that represents the double-array trie in a compressed format while maintaining the fast searches.
- Cache efficiency. Xcdat employs a minimal-prefix trie [4] that replaces redundant trie nodes into strings to reduce random access and to improve locality of references.
- Dictionary encoding. Xcdat maps
Ndistinct keywords into unique IDs from[0,N-1], and supports the two symmetric operations:lookupreturns the ID corresponding to a given keyword;decodereturns the keyword associated with a given ID. The mapping is so-called dictionary encoding (or domain encoding) and is fundamental in many DB applications as described by Martínez-Prieto et al [1] or Müller et al. [5]. - Prefix search operations. Xcdat supports prefix search operations realized by trie search algorithms:
prefix_searchreturns all the keywords contained as prefixes of a given string;predictive searchreturns all the keywords starting with a given string. These will be useful in many NLP applications such as auto completions [6], stemmed searches [7], or input method editors [8]. - 64-bit support. As mentioned before, since the double array is a pointer-based data structure, most double-array libraries use 32-bit pointers to reduce memory consumption, resulting in limiting the scale of the input dataset. On the other hand, the XCDA method allows Xcdat to represent 64-bit pointers without sacrificing memory efficiency.
- Binary key support. In normal mode, Xcdat will use the
\0character as an end marker for each keyword. However, if the dataset include\0characters, it will use bit flags instead of end markers, allowing the dataset to consist of binary keywords. - Memory mapping. Xcdat supports memory mapping, allowing data to be deserialized quickly without loading it into memory. Of course, deserialization by the loading is also supported.
- Header only. The library consists only of header files, and you can easily install it.
- Python binding. You can use Xcdat in Python3 via pybind11. (Visit the directory pybind)
Build instructions
You can download, compile, and install Xcdat with the following commands.
$ git clone https://github.com/kampersanda/xcdat.git
$ cd xcdat
$ mkdir build
$ cd build
$ cmake ..
$ make -j
$ make install
Or, since this library consists only of header files, you can easily install it by passing the include path to the directory include.
Requirements
You need to install a modern C++17 ready compiler such as g++ >= 7.0 or clang >= 4.0. For the build system, you need to install CMake >= 3.0 to compile the library.
The library requires that std::uint64_t exists. (This is true for nearly any target, even 32-bit ones.) The code has been tested only on Mac OS X and Linux. That is, this library considers only UNIX-compatible OS.
Python binding
Xcdat supports the Python binding via pybind11. The description can be found in the directory pybind.
Command line tools
Xcdat provides command line tools to build the dictionary and perform searches, which are inspired by marisa-trie. All the tools will print the command line options by specifying the parameter -h.
The tools employ the external libraries cmd_line_parser, mm_file, and tinyformat, which are contained in the repository.
xcdat_build
It builds the trie dictionary from a given dataset consisting of keywords separated by newlines. The following command builds the trie dictionary from dataset enwiki-titles.txt and writes the dictionary into file dic.bin.
$ xcdat_build enwiki-titles.txt dic.bin
Number of keys: 15955763
Number of trie nodes: 36439320
Number of DA units: 36515840
Memory usage in bytes: 1.64104e+08
Memory usage in MiB: 156.502
xcdat_lookup
It tests the lookup operation for a given dictionary. Given a query string via stdin, it prints the associated ID if found, or -1 otherwise.
$ xcdat_lookup dic.bin
Algorithm
1255938 Algorithm
Double_Array
-1 Double_Array
xcdat_decode
It tests the decode operation for a given dictionary. Given a query ID via stdin, it prints the corresponding keyword if the ID is in the range [0,N-1], where N is the number of stored keywords.
$ xcdat_decode dic.bin
1255938
1255938 Algorithm
xcdat_prefix_search
It tests the prefix_search operation for a given dictionary. Given a query string via stdin, it prints all the keywords contained as prefixes of a given string.
$ xcdat_prefix_search dic.bin
Algorithmic
6 found
57 A
798460 Al
1138004 Alg
1253024 Algo
1255938 Algorithm
1255931 Algorithmic
xcdat_predictive_search
It tests the predictive_search operation for a given dictionary. Given a query string via stdin, it prints the first n keywords starting with a given string, where n is one of the parameters.
$ xcdat_predictive_search dic.bin -n 3
Algorithm
263 found
1255938 Algorithm
1255944 Algorithm's_optimality
1255972 Algorithm_(C++)
xcdat_enumerate
It prints all the keywords stored in a given dictionary.
$ xcdat_enumerate dic.bin | head -3
0 !
107 !!
138 !!!
xcdat_benchmark
Xcdat provides the four dictionary types defined in xcdat.hpp. The tool measures the performances of them for a given dataset. To perform search operations, it randomly samples n queires from the dataset, where n is one of the parameters. It will help you determine the dictionary type.
$ xcdat_benchmark enwiki-titles.txt
** xcdat::trie_7_type **
Number of keys: 15955763
Memory usage in bytes: 1.70618e+08
Memory usage in MiB: 162.714
Construction time in seconds: 13.501
Lookup time in microsec/query: 0.5708
Decode time in microsec/query: 1.0846
** xcdat::trie_8_type **
Number of keys: 15955763
Memory usage in bytes: 1.64104e+08
Memory usage in MiB: 156.502
Construction time in seconds: 13.626
Lookup time in microsec/query: 0.6391
Decode time in microsec/query: 1.0531
** xcdat::trie_15_type **
Number of keys: 15955763
Memory usage in bytes: 2.05737e+08
Memory usage in MiB: 196.206
Construction time in seconds: 13.425
Lookup time in microsec/query: 0.3613
Decode time in microsec/query: 0.7044
** xcdat::trie_16_type **
Number of keys: 15955763
Memory usage in bytes: 2.15935e+08
Memory usage in MiB: 205.932
Construction time in seconds: 13.704
Lookup time in microsec/query: 0.3832
Decode time in microsec/query: 0.8362
Sample usage
sample/sample.cpp provides a sample usage.
#include <iostream>
#include <string>
#include <xcdat.hpp>
int main() {
// Dataset of keywords
std::vector<std::string> keys = {
"AirPods", "AirTag", "Mac", "MacBook", "MacBook_Air", "MacBook_Pro",
"Mac_Mini", "Mac_Pro", "iMac", "iPad", "iPhone", "iPhone_SE",
};
// The input keys must be sorted and unique (already satisfied in this case).
std::sort(keys.begin(), keys.end());
keys.erase(std::unique(keys.begin(), keys.end()), keys.end());
// The trie dictionary type from the four types
using trie_type = xcdat::trie_8_type;
// using trie_type = xcdat::trie_16_type;
// using trie_type = xcdat::trie_7_type;
// using trie_type = xcdat::trie_15_type;
// The dictionary filename
const char* tmp_filename = "dic.bin";
// Build and save the trie dictionary.
try {
const trie_type trie(keys);
xcdat::save(trie, tmp_filename);
} catch (const xcdat::exception& ex) {
std::cerr << ex.what() << std::endl;
return 1;
}
// Load the trie dictionary on memory.
const auto trie = xcdat::load<trie_type>(tmp_filename);
// Or, you can set the continuous memory block via a memory-mapped file.
// const auto trie = xcdat::mmap<trie_type>(mapped_data);
// Basic statistics
std::cout << "Number of keys: " << trie.num_keys() << std::endl;
std::cout << "Number of trie nodes: " << trie.num_nodes() << std::endl;
std::cout << "Number of DA units: " << trie.num_units() << std::endl;
std::cout << "Memory usage in bytes: " << xcdat::memory_in_bytes(trie) << std::endl;
// Lookup the ID for a query key.
{
const auto id = trie.lookup("Mac_Pro");
