Alphatools

Quantitative finance research tools in Python

Generate Convert Improve

Install / Use

/learn @marketneutral/Alphatools

About this skill

Quality Score

0/100

README

This package aims to provide environments within which best-in-class open source tools across both financial research (e.g., zipline, alphelens, and pyfolio) and machine learning (e.g., scikit-learn, LightGBM, PyMC3, pytorch, and fastai) operate together. The "stable" enviroment is on Python 3.5 and does not include fastai. The "latest" environment is on Python 3.6 and relies on the backwards compatibility PEP for packages which state only 3.5 support (e.g., zipline). The latest environment includes the pre-release of PyTorch 1.0 and fastai 1.0.x. The PyTorch version in both environments is currently "CPU" only (i.e., no GPU/CUDA for now). The "tests" are only testing that the environments are built without conflict for now.

Additionally, this package provides functions to make the equity alpha factor research process more accessible and productive. Convenience functions sit on top of zipline and, specifically, the Pipeline cross-sectional classes and functions in that package. alphatools allows you to

run_pipeline in a Jupyter notebook (or from any arbitrary Python code) in your local environment,
create Pipeline factors at runtime on arbitrary data sources (just expose the endpoint for data sitting somewhere, specify the schema, and...it's available for use in Pipeline!),
parse and compile "expression" style alphas as described the paper "101 Formulaic Alphas" into Pipeline factors, and
work with and plot ingested pricing data from an arbitrary bundle with a get_pricing(...) function call.

For example, with alphatools, you can, say, within a Jupyter notebook,

from alphatools.research import run_pipeline
from alphatools.ics import Sector
from alphatools.data import Factory
from alphatools.expression import ExpressionAlpha
from zipline.pipeline.data import USEquityPricing as USEP
from zipline.pipeline.factors import Returns, AverageDollarVolume
from zipline.pipeline import Pipeline

n_assets = 500
universe = AverageDollarVolume(window_length=120).top(n_assets)

my_factor = (
    (-Returns(mask=universe, window_length=5)
      .demean(groupby=Sector()))
    .rank()/n_assets
)

expr_factor = (
    ExpressionAlpha(
        'rank(indneutralize(-log(close/delay(close, 4)), IndClass.sector))'
    ).make_pipeline_factor().pipeline_factor(mask=universe)
)

p = Pipeline(screen=universe)

p.add(my_factor, '5d_MR_Sector_Neutral_Rank')
p.add(expr_factor, '5d_MR_Expression_Alpha')

p.add(Factory['my_special_data'].value.latest.zscore(), 'Special_Factor')

start_date = '2017-01-04'
end_date = '2017-12-28'

df = run_pipeline(p, start_date, end_date)

Bring Your Own Data

To "Bring Your Own Data", you simply point the Factory object to an endpoint and specify the schema. This is done by adding to the json file data_sources.json. For example, if you have a csv file on disk, data.csv, and a PostgreSQL table somewhere else, you would create data_sources.json as

{
	"my_special_data": {
		"url": "/full/path/to/data/data.csv",
		"schema": "var * {asof_date: datetime, sid: int64, value: float64}"
	},
	
	"my_database_data": {
		"url": "postgresql://$USER:$PASS@hostname::my-table-name",
		"schema": "var * {asof_date: datetime, sid: int64, price_to_book: float64}"
}

In the case of the example PostgreSQL url, note that the text $USER will be substituted with the text in the environment variable USER and the text $PASS will be substituted with the text in the environment variable PASS. Basically, any text token in the url which is preceeded by $ will be substituted by the text in the environment variable of that name. Hence, you do not need to expose actual credentials in this file.

The schema is specified as a dshape from the package datashape (docs here). The magic happens via the blaze/datashape/odo stack. You can specify the url to a huge variety of source formats including json, csv, PostgreSQL tables, MongoDB collections, bcolz, Microsoft Excel(!?), .gz compressed files, collections of files (e.g., myfiles_*.csv), and remote locations like Amazon S3 and a Hadoop Distributed File System. To me, the odo documentation on URI strings is the clearest explanation on this.

Note that this data must be mapped to the sid as mapped by zipline ingest. Also, the data rowwise dates must be in a column titled asof_date. You can then access this data like

from alphatools.data import Factory
	:
	:
	:
	
my_factor = Factory['my_database_data'].price_to_book.latest.rank()
p.add(my_factor)

This functionality should allow you to use new data in research very quickly with the absolute minimal amount of data engineering and/or munging. For example, commercial risk model providers often provide a single file per day for factor loadings (e.g., data_yyyymmdd_fac.csv). After sid mapping and converting the date column name to asof_date, this data can be immediately available in Pipeline by putting a url in data_sources.json like "url": "/path/to/dir/data_*_fac.csv", and schema like "var * {asof_date: datetime, sid: int64, MKT_BETA: float64, VALUE: float64, MOMENTUM: float64, ST_REVERSAL: float64 ...".

Expression Alphas

The ability to parse "expression" alphas is meant to help speed the research process and/or allow financial professionals with minimal Python experience to test alpha ideas. See "101 Formulaic Alphas" for details on this DSL. The (EBNF) grammar is fully specified "here". We use the Lark Python parsing library (great name, no relation). Currently, the data for open, high, low, close, volume are accessible; the following calculations and operators are implemented

vwap: the daily vwap (as a default, this is approximated with (close + (opens + high + low)/3)/2).
returns: daily close-to-close returns.
+,-, *, /, ^: as expected, though only for two terms (i.e., only <expr> <op> <expr>); ^ is exponentiation, not bitwise or.
-x: unary minus on x (i.e., negation).
abs(x), log(x), sign(x): elementwise standard math operations.
>, <, ==, ||: elementwise comparator operations returning 1 or 0.
x ? y : z: C-style ternary operator; if x: y; else z.
rank(x): scaled ranks, per day, across all assets (i.e., the cross-sectional rank); ranks are descending such that the rank of the maximum raw value in the vector is 1.0; the smallest rank is 1/N. The re-scale of the ranks to the interval [1/N,1] is implied by Alpha 1: 0.50 is subtracted from the final ranked value. The ordinal method is used to match Pipeline method .rank().
delay(x, days): x lagged by days. Note that the days parameter in delay and delta differs from the window_length parameter you may be familiar with in Pipeline. The window_length refers to a the number of data points in the (row axis of the) data matrix, not the number of days lag. For example, in Pipeline if you want daily returns, you specify a window_length of 2 since you need 2 data points--today and the day prior--to get a daily return. In an expression alpha, the days is the lag from today. Concretely, a simple example to show is: the Pipeline factor Returns(window_length=2) is precisely equal to the expression alpha delta(close,1)/delay(close,1).
correlation(x, y, days): the Pearson correlation of the values for assets in x to the corresponding values for the same assets in y over days; note this is very slow in the current implementation.
covariance(x, y, days): the covariance of the values for assets in x to the corresponding values for the same assets in y over days; note this is very slow as well currently.
delta(x, days): diff on x per days timestep.
signedpower(x, a): elementwise sign(x)*(abs(x)^a).
decay_linear(x, days): weighted sum of x over the past days with linearly decaying weights (weights sum to 1; max of the weights is on the most recent day).
indneutralize(x, g): x, cross-sectionally "neutralized" (i.e., demeaned) against the group membership classifier g. g must be in the set {IndClass.sector, IndClass.industry, IndClass.subindustry}. The set g maps to the Pipeline classifiers Sector() and SubIndustry() in alphatools.ics. Concretely, the Pipeline factor Returns().demean(groupby=Sector()) is equivalent (save a corner case on NaN treatment) to the expression indneutralize(returns, IndClass.sector). If you do not specifically pass a token for g, the default of IndClass.industry is applied.
ts_max(x, days): the per asset time series max on x over the trailing days (also ts_min(...)).
max(a, b): The paper says that max is an alias for ts_max(a, b); I think this is an error. Alphas 71, 73, 76, 87, and 96 do not parse with max as alias for ts_max. Rather I believe that max means elem

Related Skills

claude-opus-4-5-migration

112.2k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

model-usage

354.0k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

feishu-drive

354.0k

things-mac

354.0k

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)