Multi-Factor Models

Author: Jerry Xia

Date: 2018/07/27

Note: The advanced Marckdown features such as math expression may not be compatible in GitHub, please see README.pdf instead if you want more details

Project Introduction

This is a research survey about alpha trading. In this project, I built up a pipeline of alpha trading including:

factor pretest
factor screening
factor combination (modeling)

The models involed are APT models, Barra's risk models and dynamic factors model using Kalman filter.

Files

rqdata_utils.py: Utils dealing with the rice quant platform data
Step1_FactorPretest.ipynb: Factor returns profile visulization
Step2_FactorsScreening.ipynb: Factor returns turnover visulization and correlation coefficients
Step3_FactorCombination_AdaBoost_Quantopian.ipynb: A Quantopian notebook file to combine alpha factors using Adaboost
Step3_FactorCombination_BarraKalmanFilter.ipynb: Barra's risk model with three calibration schemes:
- Scheme 1: Cross-sectional regression and weighted average
- Scheme 2: Optimization problem: minimize the exponential weighted average of squared error
- Scheme 3: Dynamic linear model using Kalman filter
KalmanFilterIntro.ipynb: An introduction to the dynamic multi-factor model
APT_FammaBeth.ipynb: Using Famma-Macbeth regression to calibrate APT model.

Dataset

The dataset is not available in GitHub as it is too large. Except for Step3_FactorCombination_AdaBoost_Quantopian.ipynb which we used US stock data in Quantopian, among other files, we used Chinese A-stocks data downloaded from RiceQuant instead (hard for free US equities' data).

The data frame is multi-indexed similar to Quantopian's format(see both Alphalens github codes and rqdata_utils.py). However, feel free to cast and apply your own dataset.

TODO

Input more effective factors: take advice from people and industry reports
Should add technical analysis, because it matters! People care about them and then make it good sentimental indexes.
Find well-known metrics to express results

Workflow

$\checkmark$ stands for finished and $\vartriangle$ stands for TODO

Universe definition
Factors collection and preprocessing
- $\vartriangle$Factors collection
  - Sources
    - balance sheet
    - cash flow statement
    - income statement
    - earning report
  - Econometric Classifications
    - value
    - growth
    - profitability
    - market size
    - liquidity
    - volatility
    - Momentom
    - Financial leverage (debt-to-equity ratio)
- Factors preprocessing
  - $\vartriangle$daily, quaterly, annually
  - continuous: rescale, outliers
  - $\checkmark$discrete: rank
Factors screening and combination
- Factors screening
  - $\checkmark$Factors' correlation
  - $\checkmark$Factors' foreseeablity
  - Fama-Macbeth regression
- $\vartriangle$Factors combination
  - PCA, FA
  - Techniqual Analaysis
  - Financial Modeling
    - $\checkmark$APT model
    - $\checkmark$Barra's risk model
    - $\checkmark$Dynamic multi-factors model
  - Linear combination to maximize Sharpe ratio
  - Non-linear learning algorithms
    - $\checkmark$AdaBoost
    - Reinforcement learning
Portfolio allocation

Factors' Correlations

Here, I use correlation matrix as the measure. The difference from the second result is that the correlation matrix is calculated by the rank data rather than the raw data

Two ICs comparison

Pearson's IC: measures linear relationship between components
Spearman's IC: measures monotonic relationship between components. Since We only care about the monotonic relationships. Spearman's IC wins.

Regular IC(Pearson's correlation coefficient) for each factors

Spearman's Rank correlation coefficient for each factors

How to rule out redundant factors and why Spearman's rank correlation coefficients?

From the correlation coefficients below, we can again conclude that Spearman's rank IC is far more robust. Take ps_ratio and sales_yield as a example. $$ps_ratio = \frac{\mbox{adjusted close price}}{\mbox{sales per share}}$$ whereas $$sales_yield = \frac{\mbox{sales per share}}{\mbox{price}}$$ Ahthogh the price in sales_yield formula is vague in our data source we can see roughly speaking, these two variable should be inverse of each other. The Spearman's rank correlation coefficient is -0.98 which verifies this statement, and we should avoid using both of these factors, which would exeggarate the impact of this peticular factor. However, we can not see such identity in the Pearson's regular correlation coefficients. It's quite misleading actually and that's why we choose Spearman's rank IC.

Factors' Foreseeability

Methods

Spearman's rank correlation coefficients
Fama-Macbeth regression: Not only consider the foreseeability of factors itself but also consider the co-vary of different factors, which means rule out factors if the returns can be explained by the recent factors.

Spearman's rank IC for factors vs. forward returns

Spearman's rank IC (absolute value) for factors vs. forward returns

Rank of the Spearman's rank IC (absolute value) for factors vs. forward returns

Factors Preprocessing

Get ranked data
Obtain the valid stocks set
Reshape the data: only valid stocks set
Fill null: using daily average
Rescale the data: MinMaxScaler
Variable reduction: PCA analysis
Sanity check

Here, I use principle component analysis because it can brings two benefits to our data - orthogonality and dimensionality reduction. Orthogonality makes data more separate, less dimensionality makes information more concentrated. Either of them is essential for machine learning algorithms.

In the next part, I used this preprocessed data as the input to obtain a "mega alpha".

Mega Alpha

construct an aggregate alpha factor which has its return distribution profitable. The term "profitable" here means condense, little turnover, significant in the positive return.

Methods

linear methods

normalize factors and try a linear combination
rank each factor and then sum up
Financial modeling: See the appendix and Step3_FactorCombination_BarraKalmanFilter.ipynb
linear combination to maximize Sharpe ratio

Non-linear methods

AdaBoost: See Step3_FactorCombination_AdaBoost_Quantopian.ipynb
Reinforement Learning

Here we only introduce AdaBoost algorithm in this documentation. For more details about the linear models, please See the appendix and Step3_FactorCombination_BarraKalmanFilter.ipynb.

AdaBoost

Description

The algorithm sequentially applies a weak classification to modified versions of the data. By increasing the weights of the missclassified observations, each weak learner focuses on the error of the previous one. The predictions are aggregated through a weighted majority vote.

Algorithm

Train set

The adaboost classifier was applied to our fundamental dataset. The objective is to train a classifier which give a score for the bunch of factors. Or in other word, the mega alpha. Pink for the positive forward returns observations and blue for the negative forward returns observations. A good score system is to make the two classes more separated. We can see, in train set, AdaBoost classifier did so well! The next plot is the precision in each quantile of scores. In the top and bottom quantile, the predicted precision is nearly 100%!

Test set

alpha values histogram quantile precision bar plot The precision in the top and bottom quantile is only slightly higher than 50%. Far from good if we considered transaction cost.

So, I added some technical analysis factors to see if we can tackle this problem. Surprisingly, even the average accuracy in test set is about 67%. What if we only trade the extreme quantile? That is around 80% accuracy! It literally shows that technical factors are really important in US stock market and can be used to find arbitrage opportunity.

References

Jonathan Larkin, A Professional Quant Equity Workflow. August 31, 2016
A Practitioner‘s Guide to Factor Models. The Research Foundation of The Institute of Chartered Financial Analysts
Thomas Wiecki, Machine Learning on Quantopian
Inigo Fraser Jenkins, Using factors with different alpha decay times: The case for non-linear combination
PNC, Factor Analysis: What Drives Performance?
O’Shaughnessy, Alpha or Assets? — Factor Alpha vs. Smart Beta. April 2016
O’Shaughnessy Quarterly Investor Letter Q1 2018
Jiantao Zhu, Orient Securities, Alpha Forecasting - Factor-Based Strategy Research Series 13
Yang Song, Bohai Securities, Multi-Factor Models Research: Single Factor Testing, 2017/10/11

Appendix: Notes on Factor Models

CAPM

Author: Markovitz(1959)
single-factor:
explain: security returns

APT

Author: Stephen A. Ross(1976)
multi-factor
explain: security returns

Postulates:

The linear model $$r_i(t) - \alpha_i = \sum_{k=1}^K \beta_{ik} \cdot f_k(t) + \epsilon_i(t)$$

where $f_k(t)$ is the realization(value) of risk factor at time t

No pure arbitrage profit

Conclusion

Exposure of each security on each factor
Risk premium on each factor $$(Mean[r_i(t)])i = P_0 + \sum{k=1}^K \beta_{ik} \cdot P_k$$ or make $\beta_{0,k}$ equals 1 for each k, $$(Mean[r_i(t)])i = \sum{k=0}^K \bar{\beta}_{i,k} \cdot P_k$$ where $P_0$ is the risk free return
Portfolio exposure to each factor $$Portfolio_{it} = \beta_0

AlphaTrading

Install / Use

README

Multi-Factor Models

Project Introduction

Files

Dataset

TODO

Workflow

Factors' Correlations

Two ICs comparison

Regular IC(Pearson's correlation coefficient) for each factors

Spearman's Rank correlation coefficient for each factors

How to rule out redundant factors and why Spearman's rank correlation coefficients?

Factors' Foreseeability

Methods

Spearman's rank IC for factors vs. forward returns

Spearman's rank IC (absolute value) for factors vs. forward returns

Rank of the Spearman's rank IC (absolute value) for factors vs. forward returns

Factors Preprocessing

Mega Alpha

Methods

linear methods

Non-linear methods

AdaBoost

Description

Algorithm

Train set

Test set

References

Appendix: Notes on Factor Models

CAPM

APT

Postulates:

Conclusion