HighFrequencyTradingSVMs
This project implements a high frequency trading strategy that utilizes Support Vector Machines to capture statistical arbitrage in the pricing of Class A and Class C Google stocks.
Install / Use
/learn @botbuildtra/HighFrequencyTradingSVMsREADME
High Frequency Trading using Support Vector Machines
This project implements a high frequency trading strategy that utilizes Support Vector Machines to capture statistical arbitrage in the pricing of Class A and Class C Google stocks.
We will demonstrate a trading algorithm that earns great profits (~53% returns over 2 years) with a sharpe ratio of 11.
This idea is heavily inspired by the following paper: http://cs229.stanford.edu/proj2015/028_report.pdf by Jiayu Wu. This project includes modifications into a trading strategy that rebalances the beta ratio.
TODO: Further Model Tuning, Model Explanations, Diagnostic Graphs on labelling, Labelled profit vs. Unlabelled profit
Abstract
D.S. Ehrman defines Pairs Trading as a nondirectional, relative-value investment strategy that seeks to identify two companies with similar trading characteristics whose equity securities are currently trading at a range outside their historical range. This investment strategy entails buying the undervalued security while short-selling the overvalued security; thereby maintaining market neutrality. The position should be closed once the instruments return to statistical norms, earning the trader a profit. A good pair should share as many the same intrinsic characteristics as possible.
In our project, we will be searching for pairs trading opportunities between Class A and Class C Google stocks. Since all of the underlying fundamentals of both instruments are similar with the exception of voting rights, this pair makes a very good candidate to explore.
However, since this pair of of instruments is obviously closely related, many other players in the market are ready to profit off of any mispricings within this pair. It is not expected for any mispricings to be available for a long time. As such, we need to work in a timeframe that is as fast-paced as possible. That is why we will be using a fast-paced high frequency pairs trading strategy to capture statistical arbitrage within the pricing of GOOGL and GOOG as soon as they occur. In our project, we will be creating features from the ticker data that we feed into a machine learning model to predict profitable pairs trading opportunities.
Dataset
Our dataset contains snapshots of GOOG and GOOGL over the span of roughly 2 years (10/2016 - 11/2018) at the minute-level resolution. Our data was gathered from QuantQuote.com, a reputable dealer of fine-resolution ticker datasets. Our dataset had some missing values for both tickers. From QuantQuote's website: "Missing data for certain minutes generally means that no trades occurred during that minute." We handled this by removing entries from both datasets in which at least 1 of the tickers had a missing entry. The reasoning behind this was that is that pairs trading is impossible in such instances. This only occured for about 0.1% of our dataset.

Pairs Trading Model
The canonical pairs trading spread model looks like:

where
represents the returns of instrument
at time
and
represents the returns of instrument
at time
.
represents the spread of the returns at time
. One of the assumptions of this model
is the fact that this residual term is mean-reverting. We can assume this especially since the intrinsic characteristics
of both securities in this instance are very similar.
represents the drift term. Sometimes the spread begins to trend instead of reverting to the original mean. The drift term is one of the biggest factors of risk in pairs trading. For our problem, we assume that the drift term is negligible compared to the returns of either instrument.
represents the hedge ratio which serves to normalize the volatility between the instruments.
tells us how much of instrument
to long/short for every 1 unit of
to long/short, creating a risk neutral position. We will use the close prices to calculate percent returns and for the other features. Past work has considered assumed that
remains constant over the duration of the dataset. For our dataset, however, different behavior in the spread is apparent in 2017 and 2018. This might be due to some change of intrinsic characteristics of the instruments. For our solution, we will assume treat
as variable and recalculate it periodically.
Ornstein-Uhlenbeck (OU) Stochastic Process
The Ornstein-Uhlenbeck Stochastic Process is used in finance to model the volatility of the underlying asset price process. The process can be considered to be a modification of the random walk (Weiner Process) in which the properties of the process have been changed so that there is a tendency to walk back towards a central location. The tendency to move back towards a central location is greater when the process is further away from the mean. Thus, this process is called "mean-reverting", and has many direct applications in pairs trading.
The OU process satisfies the following stochastic differential equation:

where
> 0,
,
represent parameters,
and
denotes the Weiner process (standard Brownian motion process).
is the spread of the two instruments at time
.
measures the speed of
returning to its mean level, denoted by
.
represents the volatility of the spread.
In this project, we will start from the difference of returns at time
. Then we will
integrate this process and use a linear regression to estimate the parameters
,
, and
.
These parameters are used later for feature generation.
Features
Typically, trading strategies will only apply the spread model to the price of the instruments. In this project, however, we will extend it to also include the spread of some technical indicators, namely Simple Moving Average (SMA), Exponentially Weighted Moving Average (EWMA), Money Flow Index (Money Flow Index), and Relative Strength Index (Relative Strength Index).
The calculation for each of these features is detailed here:




We can extend the spread model to include these technical indicators in addition to price because they exhibit similar behaviors for both instruments of the pair. Furthermore, it will provide more context to the classifier in addition to the price.
Feature Generation Steps
These are the steps to follow to process the data for the Support Vector Machine algorithm to be used.
-
Split dataset into train and test partitions.
-
Calculate features (price, sma, ewma, mfi, rsi) for each instrument.
-
Calculate percent change for each feature.
-
Run linear regression on the features of the training partition of instrument A and B. This serves to estimate
according to the pairs trading spread model mentioned above.
It also calculates the residual term
. -
Construct the spread terms
with the following equation:
. -
Run lag 1 auto-regression on spread terms to get parameters according to OU model.
-
Calculate T-scores for each individual feature according to the following equation:

-
Transform the testing dataset by finding the spread residuals
by using the
obtained in the training set. -
Construct the spread terms
of the testing dataset using the following equation:
. -
Calculate the T-scores for each individual feature in the testing partition by using the
and
calculated for the training partition for each feature.
At the end of this step, we have fitted the pairs trading spread model parameters and the OU process parameters to the training partition.
Furthermore, we have transformed both the training and testing partitions according to the parameters and calculated the T-score for them.
Support Vector Machine
Support Vector Machines (SVM) is an effective machine learning algorithm for classification. The goal is to find a hyperplane that separates two classes of data maximizes the margin between the classes.
Label Generation
Though SVMs can be used for multi-class classification, we will only be utilizing it for binary classification in this instance for simplicity's sake. We define the labels for the classes as follows:

If the residual of the spread drops, that means there is profit to be made by shorting instrument A and longing instrument B before the residual drops. We want to label the instances leading up to the shift in residual as a 1, which indicates a profitable trading entry position. Otherwise we label the instance as 0 indicating that we should stay calm and do nothing. We hold on to the position for $window amount of minutes until either the residual actually drops, or we run out of time and offload our position. $threshold should be adjusted to the amount of profitability per trade, with a higher threshold corresponding to higher per-trade profit, but with less opportunities.
In our implementation, we set the $threshold parameter to 0.0005, leading to around 1/3 of our dataset to have a label of 1.
Furthermore, we set our $window parameter to 5. This means that if the residual drops by the $threshold amount within 5 minutes from
Related Skills
openhue
347.2kControl Philips Hue lights and scenes via the OpenHue CLI.
sag
347.2kElevenLabs text-to-speech with mac-style say UX.
weather
347.2kGet current weather and forecasts via wttr.in or Open-Meteo
Better-Prompt
Publishable Prompt Engineering skill package that compiles a user request into a ready-to-use high-quality Prompt, with support for diagnosis, module injection, debugging, and evaluation.
