URT

Fast Unit Root Tests and OLS regression in C++ with wrappers for R and Python.

Description

URT is a library designed to procure speed while keeping a high level of flexibility for the user when testing for a unit root in a time serie.

URT core code is in C++ and based on three of the most widely used C++ linear algebra libraries: Armadillo, Blaze and Eigen. The user can switch from one library to another and compare performances. While some are faster than other depending on array dimensions all of them have been given a chance as they are under active development and future updates might improve their respective performances. They can all be compiled by linking to external libraries for high-speed BLAS/LAPACK replacements for better performance such as Intel MKL and OpenBLAS or by using their own BLAS/LAPACK routines.

URT can also be used under R and Python. The wrapper for R called RcppURT is currenty using Armadillo and developped under Rcpp and R6 using the R package RcppArmadillo. The wrapper for Python called CyURT is currently using Blaze and developped under Cython for C++.

URT contains an Ordinary Least Squares regression (OLS) and four of the most famous unit root tests: the Augmented Dickey-Fuller test (ADF), the Dickey-Fuller Generalized Least Squares test (DF-GLS), the Phillips-Perron test and the Kwiatkowski–Phillips–Schmidt–Shin test (KPSS). ADF and DF-GLS allow for lag length optimization through different methods such as information criterion minimization and t-statistic. Test p-values can be computed via an extension of the method proposed by Cheung and Lai back in 1995 or by bootstrap.

URT is single-threaded for most of unit root tests but goes parallel for the most time consuming ones by using OpenMP. The unit root tests concerned are ADF and DF-GLS with lag length optimization by information criterion minimization.

URT has been written under Linux but can be easily adapted to run Windows or OSX as all the libraries used in this project exist on these platforms.

Why such a project and how can you contribute ?

I have been developing algorithmic trading tools for a while and it is no secret that unit root tests are widely used in this domain to decide whether a time serie is (weakly) stationary or not and construct on this idea a profitable mean-reversion strategy. Nowadays you often have to look at smaller and smaller time frames as minute data to find such trading opportunities and that means on the back-testing side using more and more historical data to test whether the strategy can be profitable on the long term or not. I found frustrating that the available libraries under R and Python, interpreted languages commonly used in the first steps of building a trading algorithm, were too slow or did not offer enough flexibility. To that extent I wanted to develop a library that could be used under higher level languages to get a first idea on the profitability of a strategy and also when developping a more serious back-tester on a larger amount of historical data under a lower level language such as C++.

In algorithmic trading we have to find the right sample size to test for stationarity. If we use a too short sample of historical data on a rolling window the back-testing will be faster but the test precision will be smaller and the results will be less reliable, on the contrary if we use a too large sample the back-testing will be slower but the test precision will be greater and the results will be more reliable. Hence, when testing for stationarity we have to always keep this tradeoff in mind. Sample sizes used are usually between 100 to 5000, leading to relatively small size arrays. I have then decided not to use parallelism for matrix and vector operations as it would not bring any speed improvement and on the contrary would slow down the code when applied on such small dimensions. Although Armadillo does not allow for parallelism yet, Blaze and Eigen do, I made sure to turn off this ability. However, parallelism is used to speed up the lag length optimization by information criterion minimization in ADF and DF-GLS tests by enabling OpenMP. All of these libraries are now using vectorization (from SSE to AVX), activating this feature greatly improves the general performance.

During my experimentations I have tried to find the correct set up for each C++ linear algebra library (Armadillo, Blaze and Eigen compiled with either Intel MKL or OpenBLAS) in order to get the fastest results on a standard sample size of 1000. If anyone can find a faster configuration for any of them, or more generally, if anyone has anything to propose that could make the C++ code or the Cython and Rcpp wrappers faster, he is more than welcome to bring his contribution to this project.

What is inside this repository ?

Ordinary Least Squares regression
Augmented Dickey-Fuller test
Dickey-Fuller Generalized Least Squares test
Phillips-Perron test
Kwiatkowski–Phillips–Schmidt–Shin test
Lag dependent unit root tests critical values and p-values
CyURT: wrapper to run URT under Python
RcppURT: wrapper to run URT under R
Benchmarks

Innovation

Unit root tests use lags in order to reduce auto-correlation as much as possible in the time serie being tested. The test p-value is lag dependent as the critical values will be different depending on the number of lags, several studies have shown this dependency and it can easily been proved by Monte-Carlo simulations. However, very few unit root tests librairies take this phenomenom into account and return wrong p-values for a large number of lags. The method used in this project is the one explained by Cheung and Lai in "Lag Order and Critical Values of the Augmented Dickey-Fuller Test" (1995). This method has been pushed further and adapted to other unit root tests.

The method is simple, starting from a chosen set of sample sizes and a chosen set of number of lags, it consists in 3 steps:

step 1: generate a non-stationary random sample (Wiener process) of a given size for ADF, DF-GLS and Phillips-Perron tests and a stationary random sample (Gaussian noise) of a given size for the KPSS test
step 2: compute the corresponding test statistic for a given number of lags
repeat step 1 and 2 many times to get the test statistics for a given pair sample size and number of lags
step 3: sort the statistics obtained to get their distribution and record the critical value for all required significance levels
repeat step 1 to 3 for all possible pairs of sample size and number of lags and fit by OLS regression these critical values for all required significance levels to the equation proposed by Cheung and Lai:

where CR(N,k) is the critical value estimate for a sample of size N and a number of lags k (and for a given significance level), T = N - k being the effective number of observations and Epsilon(N,k) the model residuals

In order to increase the precision of the method some terms have been added going further than degree 2 for the first sum and/or the second sum, while trying to get significant heteroskedasticity consistent t-statistics for the regression coefficients obtained. Both sample sizes and number of lags sets proposed by Cheung and Kai have been expanded. For the most important critical values that is the ones at the significance levels 1%, 5% and 10% for ADF, DF-GLS and Phillips-Perron tests and 99%, 95% and 90% for the KPSS test, Monte-Carlo critical values have been computed using a high number of simulations and for reduced sets of sizes and lags to compare and improve the estimated critical values precision by modifying the initial set of sizes and lags and by adding or removing some terms to the original equation proposed by Cheung and Lai.

The coefficients obtained by OLS regression for each unit root test and each significance level are reported in the header files in ./URT/include:

Coeff_adf.hpp for the ADF test
Coeff_dfgls.hpp for the DF-GLS test
Coeff_pp.hpp for the Phillips-Perron tests (t-statistic and normalized statistic)
Coeff_kpss.hpp for the KPSS test

NB: arrays indexed by 0 contain the asymptotic estimate of the critical value for the corresponding significance level Tau(0) and the coefficients of the first term of the equation Tau(i), arrays indexed by 1 contain the coefficients of the second term of the equation Phi(j).

Requirements

To use the C++ version of URT you will need to install at least one of these three free C++ linear algebra libraries:

Armadillo version >= 7.600.1
Blaze version >= 3.0
Eigen version >= 3.3.1

You will also need to have the C++ libraries Boost (version >= 1.61.0) already installed.

Blaze library requires at least LAPACK to run whereas Armadillo and Eigen have their own internal BLAS/LAPACK routines, however for better performance I recommend installing one of the following high-speed BLAS/LAPACK replacement libraries:

Intel MKL version >= 2017.

URT

Install / Use

README

URT

Description

Why such a project and how can you contribute ?

What is inside this repository ?

Innovation

Requirements