LSHBOX
A c++ toolbox of locality-sensitive hashing (LSH), provides several popular LSH algorithms, also support python and matlab.
Install / Use
/learn @RSIA-LIESMARS-WHU/LSHBOXREADME
LSHBOX-0.9
A C++ Toolbox of Locality-Sensitive Hashing for Large Scale Image Retrieval, Also Support Python and MATLAB.
Change Log
- 2015.07.04
A new LSH method, K-means Based Double-Bit Quantization for Hashing (KDBQ), was added into LSHBOX-0.9 on July 4th, 2015. We implement KDBQ by C++ but also provide MATLAB interface. And the Python interface will be added into LSHBOX-0.9 later. Other files related to KDBQ have been updated synchronously.
- 2015.06.04
A new LSH method, Double-Bit Quantization Hashing (DBQ), was added into LSHBOX-0.9 on June 4th, 2015. We implement DBQ by C++ but also provide MATLAB interface. And the Python interface will be added into LSHBOX-0.9 later. Other files related to DBQ have been updated synchronously.
Chapter 1 - Introduction
Locality-Sensitive Hashing (LSH) is an efficient method for large scale image retrieval, and it achieves great performance in approximate nearest neighborhood searching.
LSHBOX is a simple but robust C++ toolbox that provides several LSH algrithms, in addition, it can be integrated into Python and MATLAB languages. The following LSH algrithms have been implemented in LSHBOX, they are:
- LSH Based on Random Bits Sampling
- Random Hyperplane Hashing
- LSH Based on Thresholding
- LSH Based on p-Stable Distributions
- Spectral Hashing (SH)
- Iterative Quantization (ITQ)
- Double-Bit Quantization Hashing (DBQ)
- K-means Based Double-Bit Quantization Hashing (KDBQ)
There are two repositories for compilation and performance tests, they are:
- LSHBOX-3rdparty: 3rdparty of LSHBOX, it is for compilation
- LSHBOX-sample datasets: a dataset for performance tests
In addition, File-Based-ITQ is an File Based ITQ example for LSHBOX.
Part of the code depends on the C++11, So I think your compiler should support this feature. We tested LSHBOX with VS2010 in Windows 7/8 32bit/64bit and with g++ in Linux, Mac test will be done in the future. We hope that there are more people that join in the test or contribute more algrithms.
Please feel free to contact us [tanggefu@gmail.com, 20288ly@sina.cn or xiaozf@gmail.com] if you have any questions.
Chapter 2 - Compilation
LSHBOX is written by C++. And it also can be easily used in many contexts through the Python and MATLAB bindings provided with this toolbox.
LSHBOX is simple and easy to use. If you want to integrate LSHBOX into you application, it don't need compile. You only need to add the include directory or modify the program search path, then you can use this library directly in C, C++, Python or MATLAB.
If you want to test or contribute, CMAKE, a cross-platform, open-source build system, is usded to build some tools for the purpose. CMake can be downloaded from CMake' website.
In some cases, if you want or need to compile it by yourself with Python and MATLAB, please delete the comment of the last two lines in file CMakeLists.txt, and you will find the compiling progress of python must rely on Boost library or some part of this library. For more detailed information, you can view the document ./python/README.md.
During compilation, create a new directory named build in the main directory, then choose a appropriate compiler and switch to the build directory, finally, execute the following command according to your machine:
- Windows
cmake -DCMAKE_BUILD_TYPE=Release .. -G"NMake Makefiles"
nmake
- Linux & OS X
cmake ..
make
Chapter 3 - Usage
This chapter contains small examples of how to use the LSHBOX library from different programming languages (C++, Python and MATLAB).
For C++
/**
* @file itqlsh_test.cpp
*
* @brief Example of using Iterative Quantization LSH index for L2 distance.
*/
#include <lshbox.h>
int main(int argc, char const *argv[])
{
if (argc != 4)
{
std::cerr << "Usage: ./itqlsh_test data_file lsh_file benchmark_file" << std::endl;
return -1;
}
std::cout << "Example of using Iterative Quantization" << std::endl << std::endl;
typedef float DATATYPE;
std::cout << "LOADING DATA ..." << std::endl;
lshbox::timer timer;
lshbox::Matrix<DATATYPE> data(argv[1]);
std::cout << "LOAD TIME: " << timer.elapsed() << "s." << std::endl;
std::cout << "CONSTRUCTING INDEX ..." << std::endl;
timer.restart();
std::string file(argv[2]);
bool use_index = false;
lshbox::itqLsh<DATATYPE> mylsh;
if (use_index)
{
mylsh.load(file);
}
else
{
lshbox::itqLsh<DATATYPE>::Parameter param;
param.M = 521;
param.L = 5;
param.D = data.getDim();
param.N = 8;
param.S = 100;
param.I = 50;
mylsh.reset(param);
mylsh.train(data);
mylsh.hash(data);
}
mylsh.save(file);
std::cout << "CONSTRUCTING TIME: " << timer.elapsed() << "s." << std::endl;
std::cout << "LOADING BENCHMARK ..." << std::endl;
timer.restart();
lshbox::Matrix<DATATYPE>::Accessor accessor(data);
lshbox::Metric<DATATYPE> metric(data.getDim(), L1_DIST);
lshbox::Benchmark bench;
std::string benchmark(argv[3]);
bench.load(benchmark);
unsigned K = bench.getK();
lshbox::Scanner<lshbox::Matrix<DATATYPE>::Accessor> scanner(
accessor,
metric,
K
);
std::cout << "LOADING TIME: " << timer.elapsed() << "s." << std::endl;
std::cout << "RUNING QUERY ..." << std::endl;
timer.restart();
lshbox::Stat cost, recall;
lshbox::progress_display pd(bench.getQ());
for (unsigned i = 0; i != bench.getQ(); ++i)
{
mylsh.query(data[bench.getQuery(i)], scanner);
recall << bench.getAnswer(i).recall(scanner.topk());
cost << float(scanner.cnt()) / float(data.getSize());
++pd;
}
std::cout << "MEAN QUERY TIME: " << timer.elapsed() / bench.getQ() << "s." << std::endl;
std::cout << "RECALL : " << recall.getAvg() << " +/- " << recall.getStd() << std::endl;
std::cout << "COST : " << cost.getAvg() << " +/- " << cost.getStd() << std::endl;
// mylsh.query(data[0], scanner);
// std::vector<std::pair<float, unsigned> > res = scanner.topk().getTopk();
// for (std::vector<std::pair<float, unsigned> >::iterator it = res.begin(); it != res.end(); ++it)
// {
// std::cout << it->second << ": " << it->first << std::endl;
// }
// std::cout << "DISTANCE COMPARISON TIMES: " << scanner.cnt() << std::endl;
}
You can get the sample dataset audio.data from http://www.cs.princeton.edu/cass/audio.tar.gz, if the link is invalid, you can also get it from LSHBOX-sample-data.
FOR EXAMPLE, YOU CAN RUN THE FOLLOWING CODE IN COMMAND LINE AFTER BUILD ALL THE TOOLS:
> create_benchmark audio.data audio.ben 200 50
> itqlsh_test audio.data audio.itq audio.ben
NOTE1:
In our project, the format of the input file (such as audio.data, which is in float data type) is a binary file but not a text file, because binary file has many advantages. In LSHBOX/tools/create_test_data.cpp, we create a binary file with unsigned data type, from the process, you will find that the binary file is organized as the following format:
{Bytes of the data type} {The size of the vectors} {The dimension of the vectors} {All of the binary vector, arranged in turn}
For your application, you should also transform your dataset into this binary format.
NOTE2:
In addition, the dataset should be zero-centered, IT IS VERY IMPORTANT!
For Python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# pylshbox_example.py
import pylshbox
import numpy as np
print 'prepare test data'
float_mat = np.random.randn(100000, 192)
float_query = float_mat[0]
unsigned_mat = np.uint32(float_mat * 5)
unsigned_query = unsigned_mat[0]
print ''
print 'Test rbsLsh'
rbs_mat = pylshbox.rbslsh()
rbs_mat.init_mat(unsigned_mat.tolist(), '', 521, 5, 20, 5)
result = rbs_mat.query(unsigned_query.tolist(), 1)
indices, dists = result[0], result[1]
for i in range(len(indices)):
print indices[i], '\t', dists[i]
print ''
print 'Test rhpLsh'
rhp_mat = pylshbox.rhplsh()
rhp_mat.init_mat(float_mat.tolist(), '', 521, 5, 6)
result = rhp_mat.query(float_query.tolist(), 2, 10)
indices, dists = result[0], result[1]
for i in range(len(indices)):
print indices[i], '\t', dists[i]
print ''
print 'Test thLsh'
th_mat = pylshbox.thlsh()
th_mat.init_mat(float_mat.tolist(), '', 521, 5, 12)
result = th_mat.query(float_query.tolist(), 2, 10)
indices, dists = result[0], result[1]
for i in range(len(indices)):
print indices[i], '\t', dists[i]
print ''
print 'Test psdlsh with param.T = 1'
psdL1_mat = pylshbox.psdlsh()
psdL1_mat.init_mat(float_mat.tolist(), '', 521, 5, 1, 5)
result = psdL1_mat.query(float_query.tolist(), 2, 10)
indices, dists = result[0], result[1]
for i in range(len(indices)):
print indices[i], '\t', dists[i]
print ''
print 'Test psdlsh with param.T = 2'
psdL2_mat = pylshbox.psdlsh()
psdL2_mat.init_mat(float_mat.tolist(), '', 521, 5, 2, 0.5)
result = psdL2_mat.query(float_query.tolist(), 2, 10)
indices, dists = result[0], result[1]
for i in range(len(indices)):
print indices[i], '\t', dists[i]
print ''
print 'Test shLsh'
sh_mat = pylshbox.shlsh()
sh_mat.init_mat(float_mat.tolist(), '', 521, 5, 4, 100)
result = sh_mat.query(float_query.tolist(), 2, 10)
indices, dists =
