SFA
Scalable Time Series Data Analytics
Install / Use
/learn @patrickzib/SFAREADME
Time Series Data Analytics
Working with time series is difficult due to the high dimensionality of the data, erroneous or extraneous data, and large datasets. At the core of time series data analytics there are (a) a time series representation and (b) a similarity measure to compare two time series. There are many desirable properties of similarity measures. Common similarity measures in the context of time series are Dynamic Time Warping (DTW) or the Euclidean Distance (ED). However, these are decades old and do not meet today’s requirements. The over-dependence of research on the UCR time series classification benchmark has led to two pitfalls, namely: (a) they focus mostly on accuracy and (b) they assume pre-processed datasets. There are additional desirable properties: (a) alignment-free structural similarity, (b) noise-robustness, and (c) scalability.
This repository contains a symbolic time series representation (SFA), three univariate (WEASEL, BOSS and BOSSVS) and one multivariate (WEASEL+MUSE) time series model(s) for alignment-free, noise-robust and scalable time series data analytics. Finally, the early time series classification framework TEASER is provided.
The implemented algorithms are in the context of:
-
Dimensionality Reduction: SFA performs significantly better than many other dimensionality reduction techniques including those techniques based on mean values like SAX, PLA, PAA, or APCA. This is due the fact, that SFA builds upon DFT, which is significantly more accurate than the other dimensionality reduction techniques [1].
-
Classification and Accuracy: WEASEL and the BOSS ensemble classifier offer state-of-art classification accuracy [2], [3], [4].
-
Classification and Scalability: WEASEL follows the bag-of-patterns approach which achieves highly competitive classification accuracies and is very fast, making it applicable in domains with high runtime and quality constraints. The novelty of WEASEL is its carefully engineered feature space using statistical feature selection, word co-occurrences, and a supervised symbolic representation for generating discriminative words. Thereby, WEASEL assigns high weights to characteristic, variable-length substructures of a TS. In our evaluation, WEASEL is consistently among the best and fastest methods, and competitors are either at the same level of quality but much slower or faster but much worse in accuracy. [4]. The BOSS VS classifier is one to four orders of magnitude faster than state of the art and significantly more accurate than the 1-NN DTW classifier, which serves as the benchmark to compare to. I.e., one can solve a classification problem with 1-NN DTW CV that runs on a cluster of 4000 cores for one day, with the BOSS VS classifier using commodity hardware and a 4 core cpu within one to two days resulting in a similar or better classification accuracy [5].
-
Multivariate classification: WEASEL+MUSE is a multivariate time series classifier that offers state-of-art classification accuracy [6].
-
Early and accurate classification: TEASER is a framework for early and accurate time series classification. The early classification task arises when data is collected over time, and it is desirable, or even required, to predict the class label of a time series as early as possible. As such, the earlier a decision can be made, the more rewarding it can be. TEASER is two to three times as early while keeping the same (or even a higher) level of accuracy, when compared to the state of the art.

Figure (second from left) shows the BOSS model as a histogram over SFA words. It first extracts subsequences (patterns) from a time series. Next, it applies low-pass filtering and quantization to the subsequences using SFA which reduces noise and allows for string matching algorithms to be applied. Two time series are then compared based on the differences in the histogram of SFA words.
Figure (second from right) illustrates the BOSS VS model. The BOSS VS model extends the BOSS model by a compact representation of classes instead of time series by using the term frequency - inverse document frequency (tf-idf) for each class. It significantly reduces the computational complexity and highlights characteristic SFA words by the use of the tf-idf weight matrix which provides an additional noise reducing effect.
Figure (right) illustrates the WEASEL model. WEASEL conceptually builds on the bag-of-patterns model. It derives discriminative features based on dataset labels. WEASEL extracts windows at multiple lengths and also considers the order of windows (using word co-occurrences as features) instead of considering each fixed-length window as independent feature (as in BOSS or BOSS VS). It then builds a single model from the concatenation of feature vectors. It finally applies an aggressive statistical feature selection to remove irrelevant features from each class. This resulting feature set is highly discriminative, which allows us to use fast logistic regression.
Accuracy and Scalability

The figure shows for the state-of-the-art classifiers the total runtime on the x-axis in log scale vs the average rank on the y-axis for prediction. Runtimes include all preprocessing steps like feature extraction or selection.
There are fast time series classifiers (BOSS VS, TSBF, LS, DTW CV) that require a few ms per prediction, but have a low average rank; and there are accurate methods (ST; BOSS; EE; COTE) that require hundredths of ms to seconds per prediction. The two ensemble methods in our comparison, EE PROP and COTE, show the highest prediction times.
There is always a trade-off between accuracy and prediction times. However, WEASEL is consistently among the best and fastest predicting methods, and competitors are (a) either at the same level of quality (COTE) but much slower or (b) faster but much worse in accuracy (LS, DTW CV, TSBF, or BOSS VS).
How to include this project as a library
Step 1. Add the JitPack repository to your gradle build file:
allprojects {
repositories {
...
maven { url 'https://jitpack.io' }
}
}
Step 2. Add the dependency:
dependencies {
compile 'com.github.patrickzib:SFA:v0.1'
}
See for further instructions on other build systems such as maven.
How to import this project into your favorite IDE
You can import this project into your favorite IDE using gradle. This project has been tested with (minor versions might also work):
- Gradle >=3.5. Please refer to GRADLE for further instructions on how to install gradle.
- Java JVM >=1.8
The project has two gradle build targets, one for IntelliJ IDEA and one for Eclipse.
IntelliJ IDEA:
> gradle idea
:ideaModule
:ideaProject
:ideaWorkspace
:idea
BUILD SUCCESSFUL
Eclipse:
> gradle eclipse
:eclipseClasspath
:eclipseJdt
:eclipseProject
:eclipse
BUILD SUCCESSFUL
This will create an IntelliJ IDEA or Eclipse project.
SFA: Symbolic Fourier Approximation
The symbolic time series representation Symbolic Fourier Approximation (SFA) represents each real-valued time series by a string. SFA is composed of approximation using the Fourier transform and quantization using a technique called Multiple Coefficient Binning (MCB). Among its properties, the most notable are: (a) noise removal due to low-pass filtering and quantization, (b) the string representation due to quantization, and (c) the frequency domain nature of the Fourier transform. The frequency domain nature makes SFA unique among the symbolic time series representations. Dynamically adding or removing Fourier coefficients to adapt the degree of approximation is at the core of the implemented algorithms.

The figure illustrates the SFA transformation. The time series is first Fourier transformed, low-pass filtered, and then quantized to its SFA word CBBCCDCBBCBCBEBED. Higher frequency components of a signal represent rapid changes, which are often associated with noise or dropouts. By keeping the first Fourier values, the signal is smoothened, equal to a low-pass filter. Quantization builds an envelope around the Fourier transform of the time series. Since symbolic representations are essentially a character string, they can be used with string algorithms and data structures such prefix tries, bag-of-words, Markov models, or string-matching.
Usage:
First, train the SFA quantization using a set of samples.
int wordLength = 4; // represents the length of the resulting SFA words. typically, between 4 and 16.
int symbols = 4; // symbols of the discretization alphabet. 4 is the default value
// Load datasets
TimeSeries[] train = TimeSeriesLoader.loadDatset(new File("./datasets/CBF/CBF_TEST"));
// Train the SFA representation
short[][] wordsTrain = sfa.fitTransform(train, wordLength, symbols);
Next, transform a time series using the trained quantization bins.
// Transform a times series
TimeSeries ts = ...;
// DFT approximation of the time series
double[] dftTs = sfa.transformation.transform(ts, ts.getLength(), wordLength);
// SFA quantization to an SFA word
short[] wordTs = sfa.quantization(dftTs);
Similarity search using the SFA distance.
boolean normMean = true or false; // set to true, if mean should be set to 0 for a window
double distance = sfaDist
Related Skills
feishu-drive
347.0k|
things-mac
347.0kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
347.0kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
codebase-memory-mcp
1.2kHigh-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.
