RollingWindow

Intro

The purpose of this package is to calculate rolling window and expanding window statistics fast. It is aimed at any users who need to calculate rolling statistics on large data sets, and should be particularly useful for the types of analysis done in the field of quantitative finance, even though the functions implemented are primarily general-purpose.

Installing in [R]

# install.packages("devtools") # if not installed

library(devtools)
install_github("andrewuhl/RollingWindow")

Rolling Window Statistics

This package implements functions which efficiently calculate rolling window statistics using mostly single-pass algorithms. Statistics implemented include:

beta
compounding
covariance
correlation
kurtosis
mean
mean absolute error
mean of squares
mean squared error
median
min
max
product
root mean squared error
sharpe ratio
skewness
standard deviation
sum
sum product
sum of squares
sum of squared errors
variance
z score.

Note: median and min/max are exceptions as they are not calculated as a single pass algorithm, although they are still calculated quickly in O(n log window) using efficient algorithms.

Calling functions on rolling windows calculated within loops in [R] can be quite slow, even when proper attempts are made to vectorize operations within the loop. Due to performance issues, several other [R] packages implement commonly used rolling window functions -- e.g. rolling mean or rolling standard deviation -- using the languages C, Fortran or C++. Both in pure [R] and in compiled language [R] packages, the implementation is usually the naive algorithm; each time a moving window moves forward by one increment, the function is recalulated over all values in the current window.

In this package, rolling window calculations are performed in a single pass, rather than using the naive algorithm. As a rolling window is moved forward, only the oldest value has to be eliminated from the window to calculate the new rolling window statistic. Various known single pass algorithms are used, including a version of Welford's algorithm for calculating the variance. For minimum and maximum, a double-ended queue (dequeue)-based algorithm is employed. For median, for the rolling window case a circular (ring) buffer data structure is used in its algorithm which achieves O(n log window); for the expanding window case, two balanced priority queues are utilized.

This package's underlying functions are written in C++ and are built on top of the Rcpp package. In a simple set of benchmarks comparing the performance of this package's rolling functions vs. a similar package (also using Rcpp) that implements a subset of the same functions using the naive algorithm, this package's functions typically complete significantly faster, depending on the window size and function called. For large vector and/or window sizes, it is common to see a 100X+ performance improvement.

Further improvement in performance can be gained by parallelizing the rolling functions, and is left for future work.

Usage

# General form for single input (vector, matrix, list, etc.)
# result <- RollingFunction(x, window = 10)

# window size
w <- 10

# input vector
x <- rnorm(50)

# only input vector and window size required
avg <- RollingMean(x, window = w)

# std. deviation for population (n)
stddev_sample <- RollingStd(x, window = w, pop = FALSE)

# std. deviation for sample (n - 1)
stddev_pop <- RollingStd(x, window = w, pop = TRUE)

#-------------------------------------------------------------

# General form for functions with two input vectors
# result <- RollingFunction(x, y, window, ...)

# window size
w <- 10

# input vectors
x <- rnorm(50)
y <- rnorm(length(x))

corr    <- RollingCorr(x, y, window = w)
covar   <- RollingCov(x, y, window = w)
beta    <- RollingBeta(x, y, window = w)
sumprod <- RollingSumprod(x, y, window = w)

Benchmarks - Setup

To get a sense of the performance improvement by calculating rolling window statistics in a single pass, we can run the following in [R].


library(RollingWindow)
library(RcppRoll)
library(microbenchmark)
library(rbenchmark)

# setup
set.seed(10)
n    <- 10000
x    <- rnorm(n); y = rnorm(n)
win  <- 1000
reps <- 10

# relative performance
cols <- c("test", "replications", "elapsed", "relative", "user.self")

Benchmarks - Relative Performance vs. other C++ Implementation

For the first benchmark, let's compare the relative performance of this package's rolling window functions (written in C++) vs. another package's non-single pass implementation (also written in C++).

benchmark(RollingMax(x, win),    roll_max(x, win), replications = reps, columns = cols)

benchmark(RollingMean(x, win),   roll_mean(x, win), replications = reps, columns = cols)

benchmark(RollingMedian(x, win), roll_median(x, win), replications = reps, columns = cols)

benchmark(RollingMin(x, win),    roll_min(x, win), replications = reps, columns = cols)

benchmark(RollingProd(x, win),   roll_prod(x, win), replications = reps, columns = cols)

benchmark(RollingVar(x, win),    roll_var(x, win), replications  = reps, columns = cols)

benchmark(RollingStd(x, win),    roll_sd(x, win), replications    = reps, columns = cols)

In the following table, the "relative" column shows relative performance. The row with 1.0 is the faster of the two functions. For example, in the standard deviation benchmark, RollingWindow's RollingStd() function runs 121X faster than RcppRoll's roll_sd() function.

Max

| test | replications | elapsed | relative | user.self | | ---- | ------------: | -------: | --------: | ---------: | | RollingMax | 10 | 0.005 | 1.0 | 0.005 | | roll_max | 10 | 0.546 | 109.2 | 0.547 |

Mean

| test | replications | elapsed | relative | user.self | | ---- | ------------: | -------: | --------: | ---------: | | RollingMean | 10 | 0.003 | 1.0 | 0.004 | | roll_mean | 10 | 0.102 | 34 | 0.101 |

Median

| test | replications | elapsed | relative | user.self | | ---- | ------------: | -------: | --------: | ---------: | | RollingMedian | 10 | 0.018 | 1.0 | 0.019 | | roll_median | 10 | 6.362 | 353.4 | 6.363 |

Minimum

| test | replications | elapsed | relative | user.self | | ---- | ------------: | -------: | --------: | ---------: | | RollingMin | 10 | 0.005 | 1.0 | 0.005 | | roll_min | 10 | 0.688 | 137.6 | 0.686 |

Product

| test | replications | elapsed | relative | user.self | | ---- | ------------: | -------: | --------: | ---------: | | RollingProd | 10 | 0.004 | 1.0 | 0.003 | | roll_prod | 10 | 0.170 | 42.5 | 0.165 |

Standard Deviation

| test | replications | elapsed | relative | user.self | | ---- | ------------: | -------: | --------: | ---------: | | RollingStd | 10 | 0.005 | 1.0 | 0.005 | | roll_sd | 10 | 0.606 | 121.2 | 0.605 |

Variance

| test | replications | elapsed | relative | user.self | | ---- | ------------: | -------: | --------: | ---------: | | RollingVar | 10 | 0.004 | 1.0 | 0.004 | | roll_var | 10 | 0.535 | 133.8 | 0.534 |

In all cases, RollingWindow's functions are much faster, between 34X and 353X.

Benchmarks - Absolute Performance vs. other C++ Implementation

The columns below show the minimum, lower quartile, mean, median, upper quartile and max runtimes, over 10 trials.

Unit: microseconds
          expr        min         lq        mean      median         uq        max neval
    RollingMin    463.788    473.049    714.1995    553.1895    580.332   2350.546    10
      roll_min  68121.511  68135.456  68453.4996  68219.4405  68661.142  69307.602    10
    RollingMax    468.912    485.468    517.9764    525.0100    547.139    572.501    10
      roll_max  54536.383  54557.621  54764.4425  54572.1355  54655.649  56366.473    10
    RollingVar    411.392    428.191    645.7962    490.8550    511.573   2210.802    10
      roll_var  54394.805  54520.076  62992.3953  56243.7960  56426.747 130175.704    10
    RollingStd    515.981    520.592    534.3961    523.9755    530.948    581.469    10
       roll_sd  54169.392  54438.932  55226.7215  54532.8320  56237.989  56771.980    10
    RollingSum    272.602    279.169    480.0387    295.4400    322.925   2136.182    10
      roll_sum  10102.570  10110.305  10163.8197  10150.5820  10188.679  10259.007    10
   RollingMean    325.108    329.620    367.5140    337.5415    425.707    477.601    10
     roll_mean  10102.594  10129.014  10153.1129  10149.5655  10185.238  10200.878    10
   RollingProd    343.545    347.867    397.7944    408.7895    443.519    448.813    10
     roll_prod  16671.556  16675.133  16697.4066  16692.6145  16712.746  16755.501    10
 RollingMedian   1783.405   1795.463   1845.6817   1851.4000   1885.285   1935.361    10
   roll_median 634023.656 635203.964 637219.3584 636503.7700 640487.120 641341.506    10

Note that the time unit of the result is microseconds. If we look at the median, RollingMedian's mean runtime is 1.8 milliseconds, vs roll_median's mean runtime of 637 milliseconds. While both are fast in the absolute, it is easy to imagine having to run the analysis on several thousand time series with, e.g. daily data. Suppose in a financial application, 2,000 stocks with decades of daily data had to be analyzed. RollingMedian would complete in less than 4 seconds, while roll_median would take 21 minutes. Trying to calculate the same using pure [R] code would be much slower still.

Benchmarks - All RollingWindow functions

Continuing with the same benchmark setup, the runtimes for all of RollingWindow packages function are seen below.

Unit: microseconds
     expr      min       lq      mean    median       uq       max neval
     Beta  564.435  565.453  740.8940  567.8535  570.908  2297.228    10
 Compound  871.097  879.136 3212.4075  914.9325 1098.205 23471.614    10
     Corr

RollingWindow

Install / Use

README

RollingWindow

Intro

Installing in [R]

Rolling Window Statistics

Usage

Benchmarks - Setup

Benchmarks - Relative Performance vs. other C++ Implementation

Max

Mean

Median

Minimum

Product

Standard Deviation

Variance

Benchmarks - Absolute Performance vs. other C++ Implementation

Benchmarks - All RollingWindow functions