Toeplitzlda
Implementation of ToeplitzLDA for (primarily) spatiotemporal stationary time series data.
Install / Use
/learn @jsosulski/ToeplitzldaREADME
ToeplitzLDA
Code for the ToeplitzLDA classifier proposed in here. The classifier conforms sklearn and can be used as a drop-in replacement for other LDA classifiers. For in-depth usage refer to the learning from label proportions (LLP) example or the example script.
Note we used Ubuntu 20.04 with python 3.8.10 to generate our results.
Getting Started / User Setup
If you only want to use this library, you can use the following setup. Note that this setup is based on a fresh Ubuntu 20.04 installation.
Getting fresh ubuntu ready
apt install python3-pip python3-venv
Python package installation
In this setup, we assume you want to run the examples that actually make use of real EEG
data or the actual unsupervised speller replay. If you only want to employ ToeplitzLDA
in your own spatiotemporal data / without mne and moabb then you can remove the
package extra neuro, i.e. pip install toeplitzlda or pip install toeplitzlda[solver]
- (Optional) Install fortran Compiler. On ubuntu:
apt install gfortran - Create virtual environment:
python3 -m venv toeplitzlda_venv - Activate virtual environment:
source toeplitzlda_venv/bin/activate - Install toeplitzlda:
pip install toeplitzlda[neuro,solver], if you dont have a fortran compiler:pip install toeplitzlda[neuro]
Check if everything works
Either clone this repo or just download the scripts/example_toeplitz_lda_bci_data.py
file and run it: python example_toeplitz_lda_bci_data.py. Note that this will
automatically download EEG data with a size of around 650MB.
Alternatively, you can use the scripts/example_toeplitz_lda_generated_data.py where
artificial data is generated. Note however, that only stationary background noise is
modeled and no interfering artifacts as is the case in, e.g., real EEG data. As a result,
the overfitting effect of traditional slda on these artifacts is reduced.
Using ToeplitzLDA in place of traditional shrinkage LDA from sklearn
If you have already your own pipeline, you can simply add toeplitzlda as a dependency in
your project and then replace sklearns LDA, i.e., instead of:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis(solver="lsqr", shrinkage="auto") # or eigen solver
use
from toeplitzlda.classification import ToeplitzLDA
clf = ToeplitzLDA(n_channels=your_n_channels)
where your_n_channels is the number of channels of your signal and needs to be provided
for this method to work.
If you prefer using sklearn, you can only replace the covariance estimation part, note however, that this in practice (on our data) yields worse performance, as sklearn estimates the class-wise covariance matrices and averages them afterwards, whereas we remove the class-wise means and the estimate one covariance matrix from the pooled data.
So instead of:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis(solver="lsqr", shrinkage="auto") # or eigen solver
you would use
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from toeplitzlda.classification.covariance import ToepTapLW
toep_cov = ToepTapLW(n_channels=your_n_channels)
clf = LinearDiscriminantAnalysis(solver="lsqr", covariance_estimator=toep_cov) # or eigen solver
Important prerequisites for the data
To ensure ToeplitzLDA performs optimally, the following conditions should be met:
- Baseline correction should be disabled.
- If time intervals are used for the time dimension, they should be equally sized.
- The data should be in channel-prime order. For more details, see the FAQ section all the way down in this file.
Development Setup
We use a fortran compiler to provide speedups for solving block-Toeplitz linear equation
systems. If you are on ubuntu you can install gfortran.
We use poetry for dependency management. If you have it installed you can simply use
poetry install to set up the virtual environment with all dependencies. All extra
features can be installed with poetry install -E solver -E neuro.
If setup does not work for you, please open an issue. We cannot provide in-depth support for many different platforms, but could provide a singularity image.
Using the Apptainer / singularity image
Meanwhile, Singularity has been renamed to Apptainer. We added an apptainer.def
definition file which sets up an ubuntu 20.04 image which can be used to run the example
scripts.
Create container
If you have apptainer installed:
sudo apptainer build toep.sif apptainer.def
Run python script
For all intents and purposes, you can use the created image file in place of python, e.g.,
if you are in folder scripts/, instead of:
python example_toeplitz_lda_bci_data.py
You can use
../toep.sif example_toeplitz_lda_bci_data.py
Learning from label proportions
Use the run_llp.py script to apply ToeplitzLDA in the LLP scenario and create the
results file for the different preprocessing parameters. These can then be visualized
using visualize_llp.py to create the plots shown in our publication. Note that running
LLP takes a while and the two datasets will be downloaded automatically and are
approximately 16GB in size. Alternatively, you can use the results provided by us that are
stored in scripts/usup_replay/provided_results by moving/copying them to the location
that visualize_llp.py looks for.
ERP benchmark
This is not yet available.
Note this benchmark will take quite a long time if you do not have access to a computing cluster. The public datasets (including the LLP datasets) total a size of approximately 120GB.
BLOCKING TODO: How should we handle the private datasets?
- [ ] Split benchmark into public and private/closed classes
- [ ] Can we provide the code for private datasets without the data? Or is that too sensitive?
FAQ
Why is my classification performance for my stationary spatiotemporal data really bad?
Check if your data is in channel-prime order, i.e., in the flattened feature vector, you
first enumerate over all channels (or some other spatially distributed sensors) for the
first time point and then for the second time point and so on. If this is not the case,
tell the classifier: e.g. ToeplitzLDA(n_channels=16, data_is_channel_prime=False)
I dont know if my data is stationary. How can I find out?
We do not provide any statistical testing or other facilities to check for stationarity.
However, we use the blockmatrix package (disclaimer: also provided by us), which can
visualize your covariance matrix in a way that you can see if stationarity is a reasonable
assumption or not. Note however, sometimes your data will look non-stationary due to,
e.g., artifacts, even though your underlying process is stationary. This often happens if
the number of data samples to estimate the covariance is small. However, in our data it
then is often better to enforce stationarity anyhow, as you can avoid overfitting on the
presumably non-stationary observed data.
Further Work / Todos
- [ ] Example how to check data for stationarity. Maybe better in
blockmatrixpackage.
