HighFrequencyTradingSVMs

This project implements a high frequency trading strategy that utilizes Support Vector Machines to capture statistical arbitrage in the pricing of Class A and Class C Google stocks.

Generate Convert Improve

Install / Use

/learn @botbuildtra/HighFrequencyTradingSVMs

About this skill

Quality Score

0/100

README

High Frequency Trading using Support Vector Machines

This project implements a high frequency trading strategy that utilizes Support Vector Machines to capture statistical arbitrage in the pricing of Class A and Class C Google stocks.

We will demonstrate a trading algorithm that earns great profits (~53% returns over 2 years) with a sharpe ratio of 11.

This idea is heavily inspired by the following paper: http://cs229.stanford.edu/proj2015/028_report.pdf by Jiayu Wu. This project includes modifications into a trading strategy that rebalances the beta ratio.

TODO: Further Model Tuning, Model Explanations, Diagnostic Graphs on labelling, Labelled profit vs. Unlabelled profit

Abstract

D.S. Ehrman defines Pairs Trading as a nondirectional, relative-value investment strategy that seeks to identify two companies with similar trading characteristics whose equity securities are currently trading at a range outside their historical range. This investment strategy entails buying the undervalued security while short-selling the overvalued security; thereby maintaining market neutrality. The position should be closed once the instruments return to statistical norms, earning the trader a profit. A good pair should share as many the same intrinsic characteristics as possible.

In our project, we will be searching for pairs trading opportunities between Class A and Class C Google stocks. Since all of the underlying fundamentals of both instruments are similar with the exception of voting rights, this pair makes a very good candidate to explore.

However, since this pair of of instruments is obviously closely related, many other players in the market are ready to profit off of any mispricings within this pair. It is not expected for any mispricings to be available for a long time. As such, we need to work in a timeframe that is as fast-paced as possible. That is why we will be using a fast-paced high frequency pairs trading strategy to capture statistical arbitrage within the pricing of GOOGL and GOOG as soon as they occur. In our project, we will be creating features from the ticker data that we feed into a machine learning model to predict profitable pairs trading opportunities.

Dataset

Our dataset contains snapshots of GOOG and GOOGL over the span of roughly 2 years (10/2016 - 11/2018) at the minute-level resolution. Our data was gathered from QuantQuote.com, a reputable dealer of fine-resolution ticker datasets. Our dataset had some missing values for both tickers. From QuantQuote's website: "Missing data for certain minutes generally means that no trades occurred during that minute." We handled this by removing entries from both datasets in which at least 1 of the tickers had a missing entry. The reasoning behind this was that is that pairs trading is impossible in such instances. This only occured for about 0.1% of our dataset.

goog_googl

Pairs Trading Model

The canonical pairs trading spread model looks like:

spreadmodel

where datat represents the returns of instrument at time and dbtbt represents the returns of instrument at time .

represents the spread of the returns at time . One of the assumptions of this model is the fact that this residual term is mean-reverting. We can assume this especially since the intrinsic characteristics of both securities in this instance are very similar.

drift represents the drift term. Sometimes the spread begins to trend instead of reverting to the original mean. The drift term is one of the biggest factors of risk in pairs trading. For our problem, we assume that the drift term is negligible compared to the returns of either instrument.

beta represents the hedge ratio which serves to normalize the volatility between the instruments. beta tells us how much of instrument to long/short for every 1 unit of to long/short, creating a risk neutral position. We will use the close prices to calculate percent returns and for the other features. Past work has considered assumed that beta remains constant over the duration of the dataset. For our dataset, however, different behavior in the spread is apparent in 2017 and 2018. This might be due to some change of intrinsic characteristics of the instruments. For our solution, we will assume treat beta as variable and recalculate it periodically.

Ornstein-Uhlenbeck (OU) Stochastic Process

The Ornstein-Uhlenbeck Stochastic Process is used in finance to model the volatility of the underlying asset price process. The process can be considered to be a modification of the random walk (Weiner Process) in which the properties of the process have been changed so that there is a tendency to walk back towards a central location. The tendency to move back towards a central location is greater when the process is further away from the mean. Thus, this process is called "mean-reverting", and has many direct applications in pairs trading.

The OU process satisfies the following stochastic differential equation:

where theta > 0, , sigma represent parameters, and weiner denotes the Weiner process (standard Brownian motion process).

is the spread of the two instruments at time .

theta measures the speed of returning to its mean level, denoted by .

sigma represents the volatility of the spread.

In this project, we will start from the difference of returns at time . Then we will integrate this process and use a linear regression to estimate the parameters theta ,, and sigma .

These parameters are used later for feature generation.

Features

Typically, trading strategies will only apply the spread model to the price of the instruments. In this project, however, we will extend it to also include the spread of some technical indicators, namely Simple Moving Average (SMA), Exponentially Weighted Moving Average (EWMA), Money Flow Index (Money Flow Index), and Relative Strength Index (Relative Strength Index).

The calculation for each of these features is detailed here:

sma

ewma

mfi

rsi

We can extend the spread model to include these technical indicators in addition to price because they exhibit similar behaviors for both instruments of the pair. Furthermore, it will provide more context to the classifier in addition to the price.

Feature Generation Steps

These are the steps to follow to process the data for the Support Vector Machine algorithm to be used.

Split dataset into train and test partitions.
Calculate features (price, sma, ewma, mfi, rsi) for each instrument.
Calculate percent change for each feature.
Run linear regression on the features of the training partition of instrument A and B. This serves to estimate according to the pairs trading spread model mentioned above. It also calculates the residual term .
Construct the spread terms with the following equation: .
Run lag 1 auto-regression on spread terms to get parameters according to OU model.
Calculate T-scores for each individual feature according to the following equation:
Transform the testing dataset by finding the spread residuals by using the obtained in the training set.
Construct the spread terms of the testing dataset using the following equation: .
Calculate the T-scores for each individual feature in the testing partition by using the and calculated for the training partition for each feature.

At the end of this step, we have fitted the pairs trading spread model parameters and the OU process parameters to the training partition.

Furthermore, we have transformed both the training and testing partitions according to the parameters and calculated the T-score for them.

Support Vector Machine

Support Vector Machines (SVM) is an effective machine learning algorithm for classification. The goal is to find a hyperplane that separates two classes of data maximizes the margin between the classes.

Label Generation

Though SVMs can be used for multi-class classification, we will only be utilizing it for binary classification in this instance for simplicity's sake. We define the labels for the classes as follows:

label

If the residual of the spread drops, that means there is profit to be made by shorting instrument A and longing instrument B before the residual drops. We want to label the instances leading up to the shift in residual as a 1, which indicates a profitable trading entry position. Otherwise we label the instance as 0 indicating that we should stay calm and do nothing. We hold on to the position for $window amount of minutes until either the residual actually drops, or we run out of time and offload our position. $threshold should be adjusted to the amount of profitability per trade, with a higher threshold corresponding to higher per-trade profit, but with less opportunities.

In our implementation, we set the $threshold parameter to 0.0005, leading to around 1/3 of our dataset to have a label of 1.

Furthermore, we set our $window parameter to 5. This means that if the residual drops by the $threshold amount within 5 minutes from

Related Skills

openhue

347.2k

Control Philips Hue lights and scenes via the OpenHue CLI.

sag

347.2k

ElevenLabs text-to-speech with mac-style say UX.

weather

347.2k

Get current weather and forecasts via wttr.in or Open-Meteo

Better-Prompt

Publishable Prompt Engineering skill package that compiles a user request into a ready-to-use high-quality Prompt, with support for diagnosis, module injection, debugging, and evaluation.

botbuildtra

View profile

View on GitHub

GitHub Stars9

CategoryCustomer

Updated8d ago

Forks45

botbuildtra/HighFrequencyTradingSVMs

Languages

Jupyter Notebook

Security Score

70/100

Audited on Mar 26, 2026

No findings