Libcluster

License: LGPL v3 (See COPYING and COPYING.LESSER)

Overview:

This library implements the following algorithms with variational Bayes learning procedures and efficient cluster splitting heuristics:

The Variational Dirichlet Process (VDP) [1, 2, 6]
The Bayesian Gaussian Mixture Model [3 - 6]
The Grouped Mixtures Clustering (GMC) model [6]
The Symmetric Grouped Mixtures Clustering (S-GMC) model [4 - 6]. This is referred to as Gaussian latent Dirichlet allocation (G-LDA) in [4, 5].
Simultaneous Clustering Model (SCM) for Multinomial Documents, and Gaussian Observations [5, 6].
Multiple-source Clustering Model (MCM) for clustering two observations, one of an image/document, and multiple of segments/words simultaneously [4 - 6].
And more clustering algorithms based on diagonal Gaussian, and Exponential distributions.

And also,

Various functions for evaluating means, standard deviations, covariance, primary Eigenvalues etc of data.
Extensible template interfaces for creating new algorithms within the variational Bayes framework.

An example of using the MCM to simultaneously cluster images and objects within images for unsupervised scene understanding. See [4 - 6] for more information.

Dependencies
Install Instructions
C++ Interface
Python Interface
General Usability Tips
References and Citing

DEPENDENCIES

Eigen version 3.0 or greater
Boost version 1.4.x or greater and devel packages (special math functions)
OpenMP, comes default with most compilers (may need a special version of LLVM).
CMake

For the python interface:

Python 2 or 3
Boost python and boost python devel packages (make sure you have version 2 or 3 for the relevant version of python)
Numpy (tested with v1.7)

INSTALL INSTRUCTIONS

For Linux and OS X -- I've never tried to build on Windows.

To build libcluster:

Make sure you have CMake installed, and Eigen and Boost preferably in the usual locations:

 /usr/local/include/eigen3/ or /usr/include/eigen3
 /usr/local/include/boost or /usr/include/boost

Make a build directory where you checked out the source if it does not already exist, then change into this directory,
```
 cd {where you checked out the source}
 mkdir build
 cd build
```

To build libcluster, run the following from the build directory:

 cmake ..
 make
 sudo make install

This installs:

 libcluster.h    /usr/local/include
 distributions.h /usr/local/include
 probutils.h     /usr/local/include
 libcluster.*    /usr/local/lib     (* this is either .dylib or .so)

Use the doxyfile in {where you checked out the source}/doc to make the documentation with doxygen:
```
 doxygen Doxyfile
```

NOTE: There are few options you can change using ccmake (or the cmake gui), these include:

BUILD_EXHAUST_SPLIT (toggle ON or OFF, default OFF) This uses the exhaustive cluster split heuristic [1, 2] instead of the greedy heuristic [4, 5] for all algorithms but the SCM and MCM. The greedy heuristic is MUCH faster, but does give different results. I have yet to determine whether it is actually worse than the exhaustive method (if it is, it is not by much). The SCM and MCM only use the greedy split heuristic at this stage.
BUILD_PYTHON_INTERFACE (toggle ON or OFF, default OFF) Build the python interface. This requires boost python, and also uses row-major storage to be compatible with python.
BUILD_USE_PYTHON3 (toggle ON or OFF, default ON) Use python 3 or 2 to build the python interface. Make sure you have the relevant python and boost python libraries installed!
CMAKE_INSTALL_PREFIX (default /usr/local) The default prefix for installing the library and binaries.
EIGEN_INCLUDE_DIRS (default /usr/include/eigen3) Where to look for the Eigen matrix library.

NOTE: On linux you may have to run sudo ldconfig before the system can find libcluster.so (or just reboot).

NOTE: On Red-Hat based systems, /usr/local/lib is not checked unless added to /etc/ld.so.conf! This may lead to "cannot find libcluster.so" errors.

C++ INTERFACE

All of the interfaces to this library are documented in include/libcluster.h. There are far too many algorithms to go into here, and I strongly recommend looking at the test/ directory for example usage, specifically,

cluster_test.cpp for the group mixture models (GMC etc)
scluster_test.cpp for the SCM
mcluster_test.cpp for the MCM

Here is an example for regular mixture models, such as the BGMM, which simply clusters some test data and prints the resulting posterior parameters to the terminal,


#include "libcluster.h"
#include "distributions.h"
#include "testdata.h"


//
// Namespaces
//

using namespace std;
using namespace Eigen;
using namespace libcluster;
using namespace distributions;


//
// Functions
//

// Main
int main()
{

  // Populate test data from testdata.h
  MatrixXd Xcat;
  vMatrixXd X;
  makeXdata(Xcat, X);

  // Set up the inputs for the BGMM
  Dirichlet weights;
  vector<GaussWish> clusters;
  MatrixXd qZ;

  // Learn the BGMM
  double F = learnBGMM(Xcat, qZ, weights, clusters, PRIORVAL, true);

  // Print the posterior parameters
  cout << endl << "Cluster Weights:" << endl;
  cout << weights.Elogweight().exp().transpose() << endl;

  cout << endl << "Cluster means:" << endl;
  for (vector<GaussWish>::iterator k=clusters.begin(); k < clusters.end(); ++k)
    cout << k->getmean() << endl;

  cout << endl << "Cluster covariances:" << endl;
  for (vector<GaussWish>::iterator k=clusters.begin(); k < clusters.end(); ++k)
    cout << k->getcov() << endl << endl;

  return 0;
}

Note that distributions.h has also been included. In fact, all of the algorithms in libcluster.h are just wrappers over a few key functions in cluster.cpp, scluster.cpp and mcluster.cpp that can take in arbitrary distributions as inputs, and so more algorithms potentially exist than enumerated in libcluster.h. If you want to create different algorithms, or define more cluster distributions (like categorical) have a look at inheriting the WeightDist and ClusterDist base classes in distributions.h. Depending on the distributions you use, you may also have to come up with a way to 'split' clusters. Otherwise you can create an algorithm with a random initial set of clusters like the MCM at the top level, which then variational Bayes will prune.

There are also some generally useful functions included in probutils.h when dealing with mixture models (such as the log-sum-exp trick).

PYTHON INTERFACE

Installation

Easy, follow the normal build instructions up to step (4) (if you haven't already), then from the build directory:

cmake ..
ccmake .

Make sure BUILD_PYTHON_INTERFACE is ON

make
sudo make install

This installs all the same files as step (4), as well as libclusterpy.so to your python staging directory, so it should be on your python path. I.e. just run

import libclusterpy

Trouble Shooting:

On Fedora 20/21 I have to append /usr/local/lib to the file /etc/ld.so.conf to make python find the compiled shared object.

Usage

Import the library as

import numpy as np
import libclusterpy as lc

Then for the mixture models, assuming X is a numpy array where X.shape is (N, D) -- N being the number of samples, and D being the dimension of each sample,

f, qZ, w, mu, cov = lc.learnBGMM(X)

where f is the final free energy value, qZ is a distribution over all of the cluster labels where qZ.shape is (N, K) and K is the number of clusters (each row of qZ sums to 1). Then w, mu and cov the expected posterior cluster parameters (see the documentation for details. Alternatively, tuning the prior argument can be used to change the number of clusters found,

f, qZ, w, mu, cov = lc.learnBGMM(X, prior=0.1)

This interface is common to all of the simple mixture models (i.e. VDP, BGMM etc).

For the group mixture models (GMC, SGMC etc) X is a list of arrays of size (Nj, D) (indexed by j), one for each group/album, X = [X_1, X_2, ...]. The returned qZ and w are also lists of arrays, one for each group, e.g.,

f, qZ, w, mu, cov = lc.learnSGMC(X)

The SCM again has a similar interface to the above models, but now X is a list of lists of arrays, X = [[X_11, X_12, ...], [X_21, X_22, ...], ...]. This specifically for modelling situations where X is a matrix of all of the features of, for example, N_ij segments in image ij in album j.

f, qY, qZ, wi, wij, mu, cov = lc.learnSCM(X)

Where qY is a list of arrays of top-level/image cluster probabilities, qZ is a list of lists of arrays of bottom-level/segment cluster probabilities. wi are the mixture weights (list of arrays) corresponding to the qY labels, and wij are the weights (list of lists of arrays) corresponding the qZ labels. This has two optional prior inputs, and a cluster truncation level (max number of clusters) for the top-level/image clusters,

f, qY, qZ, wi, wij, mu, cov = lc.learnSCM(X, trunc=10, dirprior=1,

Libcluster

Install / Use

README

Libcluster

TABLE OF CONTENTS

DEPENDENCIES

INSTALL INSTRUCTIONS

C++ INTERFACE

PYTHON INTERFACE

Installation

Usage