.. -- mode: rst --

.. image:: doc/logo_large.png :width: 600 :alt: UMAP logo :align: center

|pypi_version|_ |pypi_downloads|_

|conda_version|_ |conda_downloads|_

|License|_ |build_status|_ |Coverage|_

|Docs|_ |joss_paper|_

.. |pypi_version| image:: https://img.shields.io/pypi/v/umap-learn.svg .. _pypi_version: https://pypi.python.org/pypi/umap-learn/

.. |pypi_downloads| image:: https://pepy.tech/badge/umap-learn/month .. _pypi_downloads: https://pepy.tech/project/umap-learn

.. |conda_version| image:: https://anaconda.org/conda-forge/umap-learn/badges/version.svg .. _conda_version: https://anaconda.org/conda-forge/umap-learn

.. |conda_downloads| image:: https://anaconda.org/conda-forge/umap-learn/badges/downloads.svg .. _conda_downloads: https://anaconda.org/conda-forge/umap-learn

.. |License| image:: https://img.shields.io/pypi/l/umap-learn.svg .. _License: https://github.com/lmcinnes/umap/blob/master/LICENSE.txt

.. |build_status| image:: https://dev.azure.com/TutteInstitute/build-pipelines/_apis/build/status/lmcinnes.umap?branchName=master .. _build_status: https://dev.azure.com/TutteInstitute/build-pipelines/_build/latest?definitionId=2&branchName=master

.. |Coverage| image:: https://coveralls.io/repos/github/lmcinnes/umap/badge.svg .. _Coverage: https://coveralls.io/github/lmcinnes/umap

.. |Docs| image:: https://readthedocs.org/projects/umap-learn/badge/?version=latest .. _Docs: https://umap-learn.readthedocs.io/en/latest/?badge=latest

.. |joss_paper| image:: http://joss.theoj.org/papers/10.21105/joss.00861/status.svg .. _joss_paper: https://doi.org/10.21105/joss.00861

==== UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:

The data is uniformly distributed on a Riemannian manifold;
The Riemannian metric is locally constant (or can be approximated as such);
The manifold is locally connected.

From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

The details for the underlying mathematics can be found in our paper on ArXiv <https://arxiv.org/abs/1802.03426>_:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

A broader introduction to UMAP targetted the scientific community can be found in our paper published in Nature Review Methods Primers <https://doi.org/10.1038/s43586-024-00363-x>_:

Healy, J., McInnes, L. Uniform manifold approximation and projection. Nat Rev Methods Primers 4, 82 (2024).

A read only version of this paper can accessed via link <https://rdcu.be/d0YZT>_

The important thing is that you don't need to worry about that—you can use UMAP right now for dimension reduction and visualisation as easily as a drop in replacement for scikit-learn's t-SNE.

Documentation is available via Read the Docs <https://umap-learn.readthedocs.io/>_.

New: this package now also provides support for densMAP. The densMAP algorithm augments UMAP to preserve local density information in addition to the topological structure of the data. Details of this method are described in the following paper <https://doi.org/10.1038/s41587-020-00801-7>_:

Narayan, A, Berger, B, Cho, H, Assessing Single-Cell Transcriptomic Variability through Density-Preserving Data Visualization, Nature Biotechnology, 2021

Installing

UMAP depends upon scikit-learn, and thus scikit-learn's dependencies such as numpy and scipy. UMAP adds a requirement for numba for performance reasons. The original version used Cython, but the improved code clarity, simplicity and performance of Numba made the transition necessary.

Requirements:

Python 3.6 or greater
numpy
scipy
scikit-learn
numba
tqdm
pynndescent <https://github.com/lmcinnes/pynndescent>_

Recommended packages:

For plotting
- matplotlib
- datashader
- holoviews
for Parametric UMAP
- tensorflow > 2.0.0

Install Options

Conda install, via the excellent work of the conda-forge team:

.. code:: bash

conda install -c conda-forge umap-learn

The conda-forge packages are available for Linux, OS X, and Windows 64 bit.

PyPI install, presuming you have numba and sklearn and all its requirements (numpy and scipy) installed:

.. code:: bash

pip install umap-learn

If you wish to use the plotting functionality you can use

.. code:: bash

pip install umap-learn[plot]

to install all the plotting dependencies.

If you wish to use Parametric UMAP, you need to install Tensorflow, which can be installed either using the instructions at https://www.tensorflow.org/install (recommended) or using

.. code:: bash

pip install umap-learn[parametric_umap]

for a CPU-only version of Tensorflow.

If you're on an x86 processor, you can also optionally install tbb, which will provide additional CPU optimizations:

.. code:: bash

pip install umap-learn[tbb]

If pip is having difficulties pulling the dependencies then we'd suggest installing the dependencies manually using anaconda followed by pulling umap from pip:

.. code:: bash

conda install numpy scipy
conda install scikit-learn
conda install numba
pip install umap-learn

For a manual install get this package:

.. code:: bash

wget https://github.com/lmcinnes/umap/archive/master.zip
unzip master.zip
rm master.zip
cd umap-master

Optionally, install the requirements through Conda:

.. code:: bash

conda install scikit-learn numba

Then install the package

.. code:: bash

python -m pip install -e .

How to use UMAP

The umap package inherits from sklearn classes, and thus drops in neatly next to other sklearn transformers with an identical calling API.

.. code:: python

import umap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = umap.UMAP().fit_transform(digits.data)

There are a number of parameters that can be set for the UMAP class; the major ones are as follows:

n_neighbors: This determines the number of neighboring points used in local approximations of manifold structure. Larger values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50, with a choice of 10 to 15 being a sensible default.
min_dist: This controls how tightly the embedding is allowed compress points together. Larger values ensure embedded points are more evenly distributed, while smaller values allow the algorithm to optimise more accurately with regard to local structure. Sensible values are in the range 0.001 to 0.5, with 0.1 being a reasonable default.
metric: This determines the choice of metric used to measure distance in the input space. A wide variety of metrics are already coded, and a user defined function can be passed as long as it has been JITd by numba.

An example of making use of these options:

.. code:: python

import umap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = umap.UMAP(n_neighbors=5,
                      min_dist=0.3,
                      metric='correlation').fit_transform(digits.data)

UMAP also supports fitting to sparse matrix data. For more details please see the UMAP documentation <https://umap-learn.readthedocs.io/>_

Benefits of UMAP

UMAP has a few signficant wins in its current incarnation.

First of all UMAP is fast. It can handle large datasets and high dimensional data without too much difficulty, scaling beyond what most t-SNE packages can manage. This includes very high dimensional sparse datasets. UMAP has successfully been used directly on data with over a million dimensions.

Second, UMAP scales well in embedding dimension—it isn't just for visualisation! You can use UMAP as a general purpose dimension reduction technique as a preliminary step to other machine learning tasks. With a little care it partners well with the hdbscan <https://github.com/scikit-learn-contrib/hdbscan>_ clustering library (for more details please see Using UMAP for Clustering <https://umap-learn.readthedocs.io/en/latest/clustering.html>_).

Third, UMAP often performs better at preserving some aspects of global structure of the data than most implementations of t-SNE. This means that it can often provide a better "big picture" view of your data as well as preserving local neighbor relations.

Fourth, UMAP supports a wide variety of distance functions, including non-metric distance functions such as cosine distance and correlation distance. You can finally embed word vectors properly using cosine distance!

Fifth, UMAP supports adding new points to an existing embedding via the standard sklearn transform method. This means that UMAP can be used as a preprocessing transformer in sklearn pipelines.

Sixth, UMAP supports supervised and semi-supervised dimension reduction. This means that if you have label information that you wish to use as extra information for dimension reduction (even if it is just partial labelling) you can do that—as simply as providing it as the y parameter in the fit method.

Seventh, UMAP supports a variety of additional experimental features including: an "inverse transform" that can approximate a high dimensional sample that would map to a given position in the embedding space; the ability to embed into non-euclidean spaces including hyperbolic embeddings, and embeddings with uncertainty; very preliminary support f

Umap

Install / Use

README

==== UMAP

Installing

How to use UMAP

Benefits of UMAP