Sequencer

An algorithm that detects one-dimensional sequences in complex datasets

Generate Convert Improve

Install / Use

/learn @dalya/Sequencer

About this skill

Quality Score

0/100

README

The Sequencer

An algorithm that detects one-dimensional trends (=sequences) in complex datasets.

Overview

The Sequencer is an algorithm that attempts to reveal the main sequence in a dataset, if it exists. To do so, it reorders objects within a set to produce the most elongated manifold describing their similarities which are measured in a multi-scale manner and using a collection of metrics. To be generic, it combines information from four different metrics: the Euclidean Distance, the Kullback-Leibler Divergence, the Monge-Wasserstein or Earth Mover Distance, and the Energy Distance. It considers different scales of the data by dividing each object in the input data into separate parts (chunks), and estimating pair-wise similarities between the chunks. It then aggregates the information in each of the chunks into a single estimator for each metric+scale.

The Sequencer uses the shape of the graphs describing the multi-scale similarities. In particular, it uses the fact that continuous trends (sequences) in a dataset lead to more elongated graphs. By defining the elongation of the graphs as a figure of merit, the Sequencer can optimize over its hyper-parameters (the distance metrics and scales) and select the set of metric+scale that are most sensitive to the presence of a sequence in the data. The elongation of the graph is measured using the graph's elongation. Thus, the final output of the Sequencer is the detected sequence and its associated elongation. Small elongation (~1) suggests no clear sequence in the data, and large elongation (~N, where N is the number of objects in the sample) suggests that the Sequencer detected a significant sequence in the data.

The Sequencer is essentially an Unsupervised Dimensionality Reduction algorithm, since a sequence is a one-dimensional embedding of the input dataset. There are several differences between the Sequencer and other dimensionality reduction techniques like tSNE and UMAP. First, the Sequencer can only embed the input dataset into a single dimension, while algorithms like tSNE and UMAP can embed the data into higher dimensions as well (2D or 3D). The main advantage of the Sequencer is that it uses a figure of merit to optimize over its hyper-parameters, while other dimensionality reduction algorithms depend on a set of hyper-parameters and lack a clear figure of merit with which these can be optimized. As a result, the output of other dimensionality reduction algorithms depends on a set of chosen hyper-parameters, which are often set manually. In our work, we show that in some cases the Sequencer outperforms tSNE and UMAP in finding one-dimensional trends in the data, especially in scientific datasets. We also show that the elongation of a graph can be used to define a figure of merit for algorithms like tSNE and UMAP as well. This figure of merit can be used to select the hyper-parameters that will give rise to the "best" tSNE and UMAP embedding (see our paper and the jupyter notebooks in the examples directory).

Online Interface

For those who are not very familiar with python: there is an online interface where one can upload their dataset and the Sequencer will be applied to it: http://sequencer.org/.

Authors

Dalya Baron (Tel Aviv University; dalyabaron@gmail.com)
Brice Ménard (Johns Hopkins University)

Requirements

The Sequencer is implemented in python and requires the following:

python 3.7.1
numpy 1.18.1
scipy 1.3.1
networkx 2.4

To run the Jupyter notebooks in the examples directory, and to compare the Sequencer to tSNE and UMAP, you will also need:

matplotlib 3.1.2
scikit-learn 0.22.1 (to run tSNE)
umap 0.3.9 (to run UMAP; https://github.com/lmcinnes/umap)

How to install the Sequencer

Manual installation through GitHub:

For a manual installation:

wget https://github.com/dalya/Sequencer/archive/master.zip
unzip master.zip
rm master.zip
cd Sequencer-master/

Install the requirements:

sudo pip install -r requirements.txt

Install the package:

python setup.py install

Using PyPI:

The PyPI version of the code is not up to date and includes several deprecated functions, resulting in running errors when installed from pip. Please install from GitHub for now, and hopefully I will find time to update the PyPI soon :)

How to use the Sequencer

The examples directory consists of several Jupyter notebooks showing how to use the Sequencer. New users are encoraged to start with the notebook basic_sequencer_functionalities.ipynb, which explains the basic functionalities of the Sequencer.

Below there is an example of a simple Sequencer run with default parameters:

import sequencer

# define the Sequencer object
estimator_list = ['EMD', 'energy', 'KL', 'L2']
seq = sequencer.Sequencer(grid, objects_list, estimator_list)

# execute the Sequencer
output_directory_path = "sequencer_output_directory"
final_elongation, final_sequence = seq.execute(output_directory_path)

The definition of the Sequencer object required a few input parameters: grid: the x-axis of the objects and objects_list: a list of the objects. This is the input dataset within which we want to find a sequence. The estimator_list is a list of strings containing the distance metrics to be considered during the algorithm run. It contains four distance metrics at the moment: 'EMD'=Earth Mover Distance, 'energy'=Energy Distance, 'KL'=Kullback-Leibler Divergence, and 'L2'=Euclidean Distance. The Sequnecer defines a list of scales it will consider automatically, where the number and values of the scales will be determined by the size of the data. However, users can set the scales by themselves. For example, if we wish to consider only the largest scales, and not to divide the objects into chunks, then:

# define the Sequencer object
estimator_list = ['EMD', 'energy', 'KL', 'L2']
scale_list = [[1], [1], [1], [1]]
seq = sequencer.Sequencer(grid, objects_list, estimator_list, scale_list)

If instead we are intrested in small-scale information, where we want to split each object into 10 separate parts and examine them separately, then:

# define the Sequencer object
estimator_list = ['EMD', 'energy', 'KL', 'L2']
scale_list = [[10], [10], [10], [10]]
seq = sequencer.Sequencer(grid, objects_list, estimator_list, scale_list)

Finally, if we do not know a-priori on which scale the relevant information is, we might want to examine several scales:

# define the Sequencer object
estimator_list = ['EMD', 'energy', 'KL', 'L2']
scale_list = [[1, 2, 4, 8], [1, 2, 4, 8], [1, 2, 4, 8], [1, 2, 4, 8]]
seq = sequencer.Sequencer(grid, objects_list, estimator_list, scale_list)

This means that for each metric, the Sequencer will examine four different scales: 1, 2, 4, ans 8. A scale=2 means that the Sequencer will split the objects into two parts and will search for a sequence in each of the parts separately. It will then aggregate the information from both of the parts into a single estimator. Finally, it will aggregate the information from all the combinations of metrics+scales into a single estimator, where metrics+scales that result in larger elongations will get a higher weight in the final product.

To execute the Sequencer obejct, we needed to define output_directory_path, which is a path of a directory to which different Sequencer products will be saved. The final output of the function consists of: (1) final_elongation: the elongation of the resulting sequence. An elongation that is close to 1 suggests no clear sequence in the data. An elongation close to N, where N is the number of objects in the sample, suggests that the Sequencer detected a significant sequence in the data. (2) final_sequence: the resulting sequence. This is an array that contains the relative order of the different objects in the sample, such that they form the detected sequence.

Examples directory

The examples directory contains several Jupyter notebooks that illustrate different aspects of the Sequencer algorithm. Users who are not familiar with the Sequencer are encouraged to go through the examples in the following order:

basic_sequencer_functionalities.ipynb: this notebook shows the basic functionalities of the algorithm. It shows how to apply the Sequencer to a simple simulated dataset with default settings. It shows how to extract various interesting properties of the algorithm, such as the intermediate elongations obtained during the calculation. It then shows some non-default settings that the user should be aware of (parallelization, varying scales, output options).
comparison_with_tsne_and_umap.ipynb: this notebook compares the Sequencer output to the one-dimensional embedding by tSNE and UMAP for a simulated dataset. Importantly, this notebook presents a method to define a general figure of merit for Dimensionality Reduction algorithms that can be used to optimize their hyper-parameters. Using this figure of merit, we can optimize the hyper-parameters of tSNE and UMAP and compare their resulting sequence to the one obtained with the Sequencer.
visualizing_the_MST.ipynb: this notebook shows how to extract the resulting final Minimum Spanning Tree and how to visualize it with the networkx package.
`examples_with_n

Related Skills

node-connect

351.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。