dbMAP (diffusion-based Manifold Approximation and Projection)

Diffusion-based Manifold Approximaiton and Projection (dbMAP) is a fast, accurate and modularized machine-learning framework that includes metric-learning, diffusion harmonics and dimensional reduction. dbMAP is particularly useful for analyzing highly-structured data, such as from single-cell RNA sequencing assays. dbMAP was originally designed for the analysis and visualization of single-cell omics data - yet, as a general dimensional reduction approach based on solid discrete differential geometry, it can be useful in virtually any field in which analysis of high-dimensional data is challenging.

dbMAP explores the use of the Laplace-Beltrami Operator (LBO) to geometrically describe heterogeneity within a given high-dimensional dataset. By extending Diffusion Maps and providing a scalable and computationally efficient implementation of the algorithm, it is possible to learn a diffusion basis that approximate the LBO. In dbMAP, this adaptive diffusion basis is used to estimate data's intrinsic dimensionality and then multiscaled to account for all possible diffusion timescales. From this basis, it is possible to build a diffusion graph, which can be visualized with different layout optimizations. Originally, we devised the layout optimization step to be performed with UMAP with a multi-component Laplacian Eigenmaps initialization. Since the LBO is approximated by the diffusion basis, the LE initialization and the seminal UMAP optimization, this leads to a geometrical consensus on the structure of the underlying data, providing fine visualizations. However, recently developed optimization methods, such as PaCMAP and pyMDE can also be employed for laying out the diffusion graph. For more information on dbMAP, check our preprint.

This implementation includes a flexible and extendable wrapper for nmslib, for state-of-the-art approximate nearest-neighbors search, functions for fast computation of diffusion dynamics and multiscale diffusion maps into a diffusion basis, and faster implementations of adapted UMAP and PaCUMAP layouts. Further documentation is available at Read the Docs.

Installation and dependencies

Prior to installing dbMAP, make sure you have scikit-build and cmake available in your system. These are required for installation.

  $> sudo apt-get install cmake
  $> pip3 install scikit-build

We're also going to need NMSlib for really fast approximate nearest-neighborhood search, and Annoy for fast indexing:

 $> pip3 install nmslib annoy

You can read more about NMSlib here, and check more on the available distances and spaces documentation here. dbMAP implements functions derived from scikit-learn base transformers tat make NMSlib more generally extendable to machine-leraning workflows, and we are grateful to the nmslib community for their insights during this process. For now the dependency on annoy is intended to give support to the PaCMAP optimization, but I'm working on keeping it to all to NMSlib.

Using dbMAP

dbMAP consists of two main steps: an adaptive anisotropic reproduction of the initial input diffusion structure, followed by an accelerated UMAP or graph layout. dbMAP runs on numpy arrays, pandas dataframes and csr or coo sparse matrices. The adaptive diffusion reduction is recommended over PCA if data is significantly non-linear, and is useful for clustering and downstream analysis. The UMAP and graph layouts are also useful for big data visualization. Here follows some examples on using dbMAP implemented algorithms, including fast neighborhood search, adaptive multiscaled diffusion maps and accelerated UMAP and graph layouts:

1 - Fast approximate k-nearest-neighbors

dbMAP implements the NMSlibTransformer() class, which calls nmslib to perform a fast and accurate approximate nearest neighbor search. The NMSlibTransformer() class has several methods to compute and retrieve this information, and an additional function to measure it's accuracy.

# Load some libraries:
from sklearn.datasets import load_digits
from scipy.sparse import csr_matrix
import dbmap as dm

# Load some data and convert to CSR for speed:
digits = load_digits()
data = csr_matrix(digits.data)

# Initialize the NMSlibTransformer() object and index the data:
anbrs = dm.ann.NMSlibTransformer() # Feel free to play with parameters
anbrs = anbrs.fit(data)

# Compute the knn_neighbors graph:
knn_graph = anbrs.transform(data)

# Compute indices, distances, gradient and knn_neighbors graph:
inds, dists, grad, knn = anbrs.ind_dist_grad(data)

# Test approximate-neighbors accuracy:
anbrs.test_efficiency(data)

2 - Fast adaptive multiscaled diffusion maps

dbMAP implements the Diffusor() class, which allows state-of-the-art dimensional reduction by the fast approximation of the Laplace Beltrami operator and automatic detection of intrinsic dimensionality. This algorithm learns a local metric which is normalized and embedded as a diffusion distance on the series of orthogonal components that define structure variability within the initial informational space. Default machine-learning analysis sometimes employs PCA on highly non-linear data despite its caveat of being unsuitable for datasets which cannot be represented as a series of linear correlations. The main reason for this is the low computational cost of PCA compared to non-linear dimensional reduction methods. Our implementation is scalable to extremely high-dimensional datasets (10e9 samples) and oughts to provide more reliable information than PCA on real-world, non-linear data. Similarly to our fast nearest-neighbor implementation, we provide utility functions to obtain results in different formats.

# Load some libraries:
from sklearn.datasets import load_digits
from scipy.sparse import csr_matrix
import dbmap as dm

# Load some data and convert to CSR for speed:
digits = load_digits()
data = csr_matrix(digits.data)
 
# Initialize the diffusor object and fit data:
diff = dm.diffusion.Diffusor().fit(data)
 
# Return low dimensional representation of data:
res = diff.transform(data)
 
# Return the diffusion indices, distances, diffusion gradient and diffusion graph:
ind, dist, grad, graph = diff.ind_dist_grad(data)

A key feature of dbMAP diffusion approach is its ability to indirectly estimate data intrinsic dimensionality by looking for all positive-eigenvalued components. The algorithm tries to find an optimal number of final components for eigendecomposition such as to find an eigengap that maximizes the information each component carries. In other words, we want to compute the minimal number of components needed to find negative-valued components. These can then be visualized as follows:

import matplotlib.pyplot as plt

res = diff.return_dict()

plt.plot(range(0, len(res['EigenValues'])), res['EigenValues'], marker='o')

3 - Fast mapping layout:

For scalable big data visualization, we provide a fast mapping layout of the adaptive multiscale diffusion components space. We adapted UMAP to construct fast approximate simplicial complexes wich normalizes the data structure, rendering a comprehensive layout. We also provide fast graph layout of the resulting components with fa2, which implements scalable and interative layouts within networkx. A vanilla UMAP implementation is also provided.

# Load some libraries:
from sklearn.datasets import load_digits
from scipy.sparse import csr_matrix
import dbmap as dm
import umap

# Load some data and convert to CSR for speed:
digits = load_digits()
data = csr_matrix(digits.data)
 
# Fit the diffusion model
digits_diff = dm.diffusion.Diffusor(n_neighbors=30, n_components=120,
                                      kernel_use='simple_adaptive', # The diffusion tool is highly
                                      norm=True,          # customizable!
                                      transitions=False).fit(digits.data)
# Decompose the diffusion basis
db = digits_diff.transform(digits.data)
  
# Visualize data intrinsic dimensionality
res = digits_diff.return_dict()
plt.plot(range(0, len(res['EigenValues'])), res['EigenValues'])
plt.show()  

# We're currently improving the non-uniform simplicial embedding and introducing an adaptive optimization layout procedure (AdapMAP). 
# Please use vanilla UMAP or our implementation of PaCMAP in the meanwhile:

# Embed diffusion graph with vanilla UMAP:
db_umap_emb = dm.map.UMAP(n_epochs=100).fit(transform(db.to_numpy(dtype='float32'))

plt.scatter(db_umap_emb[:, 0], db_umap_emb[:, 1],

DbMAP

Install / Use

README