`pymssa` Readme and User Guide

The pymssa package implements Multivariate Singular Spectrum Analysis in python. As of the time of this writing, I am not aware of any other implementation in python of multivariate SSA, though there are packages and implementations of univariate SSA. R on the other hand has the extremely comprehensive Rssa package for performing MSSA and its variants.

This is definitely not as comprehensive as the Rssa package (which I believe is maintained by the creator of SSA), but it implements the fundamental components of MSSA. As of the time of this writing, the MSSA class features:

Uni- or Multi-variate decomposition of timeseries using Singular Spectrum Analysis
Automated options for selection of number of components:
- Variance explained thresholding
- Parallel analysis
- Singular Value Hard Thresholding
Calculation of the w-correlation matrix of component correlations.
Easy interface to assign and retrieve component groupings.
Easy accesss via attributes to:
- MSSA component matrix (or "reconstructions")
- Left singular vectors and singular values
- Explained variance scores and percent variance explained for components
- Per-timeseries component rankings by variance explained/reconstruction error.
Recurrent forecasting function. You specify timepoints out and optionally which timeseries and with which components to forecast with.

`pymssa` Uses `numba`-optimized Functions (where possible)

While the MSSA class provides the main interface to the MSSA algorithm and tools, you will notice if you look through the code that the majority of the actual functions and math are imported from the optimized submodule. MSSA can be pretty slow and memory intensive, especially if you are dealing with large timeseries data. Where possible, I've converted the operations to numba-compiled functions that significantly speed up the algorithm. I've also tried to organize the loops, initializations, and intermediary steps in such a way that will minimize the memory required.

With a large enough dataset you will still run into problems, particularly for the SVD steps and reconstruction of components steps, despite the optimizations. At a certain point I'm not sure if there are ways around this or if you just need a bigger computer. I am not a code optimization expert, so any contributions that help with this are more than welcome! See the Contributing section for more ideas if you're interested in adding to this package.

Demo of `MSSA` on Austrailian Wine Dataset

To demonstrate the features of the MSSA class, and provide a general walkthrough of the steps involved in a standard multivariate singular spectrum analysis, I will load an example dataset that comes packaged with the Rssa R package.

This data has 7 timeseries and 187 observations (some of which are null values). It is monthly data spanning from 1980 to 1995, and tracks sales of Austrailian wine. The columns are types of wine, including:

Total
Drywhite
Fortified
Red
Rose
Sparkling
Sweetwhite

You can see the header of this dataset below.

import pandas as pd
import numpy as np
import scipy

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

from sklearn.metrics import r2_score

wine_raw = pd.read_csv('wine.csv').iloc[:,1:]
wine = wine_raw[['Total','Drywhite','Fortified','Red','Rose','Sparkling','Sweetwhite']]
date_index = pd.date_range(start='1/1/1980', periods=wine.shape[0], freq='M')
wine.index = date_index
wine.head()

<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Total</th> <th>Drywhite</th> <th>Fortified</th> <th>Red</th> <th>Rose</th> <th>Sparkling</th> <th>Sweetwhite</th> </tr> </thead> <tbody> <tr> <th>1980-01-31</th> <td>15136.0</td> <td>1954</td> <td>2585</td> <td>464</td> <td>112.0</td> <td>1686</td> <td>85</td> </tr> <tr> <th>1980-02-29</th> <td>16733.0</td> <td>2302</td> <td>3368</td> <td>675</td> <td>118.0</td> <td>1591</td> <td>89</td> </tr> <tr> <th>1980-03-31</th> <td>20016.0</td> <td>3054</td> <td>3210</td> <td>703</td> <td>129.0</td> <td>2304</td> <td>109</td> </tr> <tr> <th>1980-04-30</th> <td>17708.0</td> <td>2414</td> <td>3111</td> <td>887</td> <td>99.0</td> <td>1712</td> <td>95</td> </tr> <tr> <th>1980-05-31</th> <td>18019.0</td> <td>2226</td> <td>3756</td> <td>1139</td> <td>116.0</td> <td>1471</td> <td>91</td> </tr> </tbody> </table> </div>

A Brief Note on the Math and Algorithms in MSSA

I've chosen not to cover the math behind MSSA in this demo. There are many resources online and I would prefer this user guide to focus on the usage and implementation of MSSA in this package. However, for those in need of a reference there are two resources in particular that I think are very informative:

For an overview of the math and walkthrough of the code behind singular spectrum analysis I highly recommend this blog post by Jordan D'Arcy: Introducing SSA for Time Series Decomposition. It is probably the best walkthrough of SSA that I have seen so far. It is for univariate SSA rather than multivariate SSA, but the concepts and math are essentially the same for both.
For the multivariate case, a fairly detailed overview of the math, trajectory matrix construction, and forecasting algorithm can be found in this paper available on Researchgate: Multivariate singular spectrum analysis: A general view and new vector forecasting approach. The implementation here corresponds to the vertical formulation of MSSA (V-MSSA), which the authors in that paper argue is superior in performance to the horizontal forumlation. The forecasting in this package follows the recurrent forecasting formula for VMSSA. I would like to eventually have the vector forecasting method implemented as well, but have not coded it yet. See the Contributing section for this and more ideas if you'd like to add it.

Centering and Splitting the Data

In order to validate the forecasting we will do at the end, I am going to split the wine data into training and testing. I've chosen to leave off 48 months, or 4 years of wine sales data, to serve as my holdout test set for validation.

There are some null values in the dataset, but they all happen to occur in the most recent 48 months, which we are leaving off for testing. This is nice since we only need to worry about null values in the training set. They are fine to occur in the testing set, we will just have fewer values in some cases to calculate our performance metric.

tp = 48

wine_tr = wine.iloc[:-tp]
wine_te = wine.iloc[-tp:]

I'm also going to center the data. If you do not center the data then the first component is just going to end up being the offset that centers the reconstruction anyway. There is no reason not to just deal with it prior to the decomposition.

Note: you may also choose to standardize the data by also dividing by the standard deviation. I've chosen not to do this here just to keep things on their original scale, but standardization is a good preprocessing step to do prior to decomposition to ensure that the contribution of variance by each timeseries is on equal ground.

tr_means = np.nanmean(wine_tr, axis=0)
wine_tr = wine_tr - tr_means
wine_te = wine_te - tr_means

Fitting with `MSSA`

Now we can instantiate the MSSA object and fit to the training data. There are a handful of instantiation arguments that we will cover incrementally over the course of this demo. They are:

window_size
n_components
variance_explained_threshold
pa_percentile_threshold
svd_method
varimax (experimental)
verbose

The only option that I will not be covering in the demo is varimax, which is designed to perform a structured varimax on the left singular values after decomposition to "sparsify" the components. This is experimental and I'm not totally confident its doing what its supposed to yet, so I am going to leave it out of this demo for now.

from pymssa import MSSA

/Users/kieferkatovich/.virtualenvs/py3/lib/python3.7/site-packages/tqdm/autonotebook/__init__.py:14: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  " (e.g. in jupyter console)", TqdmExperimentalWarning)

I'll instantiate the MSSA object with n_components=None and window_size=None. This will set the number of components to be the maximum number of components, and the window size to be the maximum window size.

n_components works the same as this option in scikit-learn's decomposition functions when the values are either None or an integer. As None, the maximum number will be selected, and as an integer only that number of components will be selected. There are also some other options for this argument to automate the selection of components that I will cover later in the demo.

mssa = MSSA(n_components=None,
            window_size=None,
            verbose=True)

What does `window_size` do?

The window_size argument co

Pymssa

Install / Use

README

`pymssa` Readme and User Guide

`pymssa` Uses `numba`-optimized Functions (where possible)

Demo of `MSSA` on Austrailian Wine Dataset

A Brief Note on the Math and Algorithms in MSSA

Centering and Splitting the Data

Fitting with `MSSA`

What does `window_size` do?

Pymssa

Install / Use

README

pymssa Readme and User Guide

pymssa Uses numba-optimized Functions (where possible)

Demo of MSSA on Austrailian Wine Dataset

A Brief Note on the Math and Algorithms in MSSA

Centering and Splitting the Data

Fitting with MSSA

What does window_size do?

`pymssa` Readme and User Guide

`pymssa` Uses `numba`-optimized Functions (where possible)

Demo of `MSSA` on Austrailian Wine Dataset

Fitting with `MSSA`

What does `window_size` do?