Alphatools
Quantitative finance research tools in Python
Install / Use
/learn @marketneutral/AlphatoolsREADME
This package aims to provide environments within which best-in-class open source tools across both financial research (e.g., zipline, alphelens, and pyfolio) and machine learning (e.g., scikit-learn, LightGBM, PyMC3, pytorch, and fastai) operate together. The "stable" enviroment is on Python 3.5 and does not include fastai. The "latest" environment is on Python 3.6 and relies on the backwards compatibility PEP for packages which state only 3.5 support (e.g., zipline). The latest environment includes the pre-release of PyTorch 1.0 and fastai 1.0.x. The PyTorch version in both environments is currently "CPU" only (i.e., no GPU/CUDA for now). The "tests" are only testing that the environments are built without conflict for now.
Additionally, this package provides functions to make the equity alpha factor research process more accessible and productive. Convenience functions sit on top of zipline and, specifically, the Pipeline cross-sectional classes and functions in that package. alphatools allows you to
run_pipelinein a Jupyter notebook (or from any arbitrary Python code) in your local environment,- create
Pipelinefactors at runtime on arbitrary data sources (just expose the endpoint for data sitting somewhere, specify the schema, and...it's available for use inPipeline!), - parse and compile "expression" style alphas as described the paper "101 Formulaic Alphas" into
Pipelinefactors, and - work with and plot ingested pricing data from an arbitrary bundle with a
get_pricing(...)function call.
For example, with alphatools, you can, say, within a Jupyter notebook,
from alphatools.research import run_pipeline
from alphatools.ics import Sector
from alphatools.data import Factory
from alphatools.expression import ExpressionAlpha
from zipline.pipeline.data import USEquityPricing as USEP
from zipline.pipeline.factors import Returns, AverageDollarVolume
from zipline.pipeline import Pipeline
n_assets = 500
universe = AverageDollarVolume(window_length=120).top(n_assets)
my_factor = (
(-Returns(mask=universe, window_length=5)
.demean(groupby=Sector()))
.rank()/n_assets
)
expr_factor = (
ExpressionAlpha(
'rank(indneutralize(-log(close/delay(close, 4)), IndClass.sector))'
).make_pipeline_factor().pipeline_factor(mask=universe)
)
p = Pipeline(screen=universe)
p.add(my_factor, '5d_MR_Sector_Neutral_Rank')
p.add(expr_factor, '5d_MR_Expression_Alpha')
p.add(Factory['my_special_data'].value.latest.zscore(), 'Special_Factor')
start_date = '2017-01-04'
end_date = '2017-12-28'
df = run_pipeline(p, start_date, end_date)
Bring Your Own Data
To "Bring Your Own Data", you simply point the Factory object to an endpoint and specify the schema. This is done by adding to the json file data_sources.json. For example, if you have a csv file on disk, data.csv, and a PostgreSQL table somewhere else, you would create data_sources.json as
{
"my_special_data": {
"url": "/full/path/to/data/data.csv",
"schema": "var * {asof_date: datetime, sid: int64, value: float64}"
},
"my_database_data": {
"url": "postgresql://$USER:$PASS@hostname::my-table-name",
"schema": "var * {asof_date: datetime, sid: int64, price_to_book: float64}"
}
In the case of the example PostgreSQL url, note that the text $USER will be substituted with the text in the environment variable USER and the text $PASS will be substituted with the text in the environment variable PASS. Basically, any text token in the url which is preceeded by $ will be substituted by the text in the environment variable of that name. Hence, you do not need to expose actual credentials in this file.
The schema is specified as a dshape from the package datashape (docs here). The magic happens via the blaze/datashape/odo stack. You can specify the url to a huge variety of source formats including json, csv, PostgreSQL tables, MongoDB collections, bcolz, Microsoft Excel(!?), .gz compressed files, collections of files (e.g., myfiles_*.csv), and remote locations like Amazon S3 and a Hadoop Distributed File System. To me, the odo documentation on URI strings is the clearest explanation on this.
Note that this data must be mapped to the sid as mapped by zipline ingest. Also, the data rowwise dates must be in a column titled asof_date. You can then access this data like
from alphatools.data import Factory
:
:
:
my_factor = Factory['my_database_data'].price_to_book.latest.rank()
p.add(my_factor)
This functionality should allow you to use new data in research very quickly with the absolute minimal amount of data engineering and/or munging. For example, commercial risk model providers often provide a single file per day for factor loadings (e.g., data_yyyymmdd_fac.csv). After sid mapping and converting the date column name to asof_date, this data can be immediately available in Pipeline by putting a url in data_sources.json like "url": "/path/to/dir/data_*_fac.csv", and schema like "var * {asof_date: datetime, sid: int64, MKT_BETA: float64, VALUE: float64, MOMENTUM: float64, ST_REVERSAL: float64 ...".
Expression Alphas
The ability to parse "expression" alphas is meant to help speed the research process and/or allow financial professionals with minimal Python experience to test alpha ideas. See "101 Formulaic Alphas" for details on this DSL. The (EBNF) grammar is fully specified "here". We use the Lark Python parsing library (great name, no relation). Currently, the data for open, high, low, close, volume are accessible; the following calculations and operators are implemented
vwap: the daily vwap (as a default, this is approximated with(close + (opens + high + low)/3)/2).returns: daily close-to-close returns.+,-,*,/,^: as expected, though only for two terms (i.e., only <expr> <op> <expr>);^is exponentiation, not bitwise or.-x: unary minus on x (i.e., negation).abs(x),log(x),sign(x): elementwise standard math operations.>,<,==,||: elementwise comparator operations returning 1 or 0.x ? y : z: C-style ternary operator;if x: y; else z.rank(x): scaled ranks, per day, across all assets (i.e., the cross-sectional rank); ranks are descending such that the rank of the maximum raw value in the vector is 1.0; the smallest rank is 1/N. The re-scale of the ranks to the interval [1/N,1] is implied by Alpha 1: 0.50 is subtracted from the final ranked value. The ordinal method is used to matchPipelinemethod.rank().delay(x, days): x lagged by days. Note that the days parameter indelayanddeltadiffers from thewindow_lengthparameter you may be familiar with inPipeline. Thewindow_lengthrefers to a the number of data points in the (row axis of the) data matrix, not the number of days lag. For example, inPipelineif you want daily returns, you specify awindow_lengthof2since you need 2 data points--today and the day prior--to get a daily return. In an expression alpha, the days is the lag from today. Concretely, a simple example to show is: thePipelinefactorReturns(window_length=2)is precisely equal to the expression alphadelta(close,1)/delay(close,1).correlation(x, y, days): the Pearson correlation of the values for assets in x to the corresponding values for the same assets in y over days; note this is very slow in the current implementation.covariance(x, y, days): the covariance of the values for assets in x to the corresponding values for the same assets in y over days; note this is very slow as well currently.delta(x, days): diff on x per days timestep.signedpower(x, a): elementwisesign(x)*(abs(x)^a).decay_linear(x, days): weighted sum of x over the past days with linearly decaying weights (weights sum to 1; max of the weights is on the most recent day).indneutralize(x, g):x, cross-sectionally "neutralized" (i.e., demeaned) against the group membership classifierg.gmust be in the set {IndClass.sector,IndClass.industry,IndClass.subindustry}. The setgmaps to thePipelineclassifiersSector()andSubIndustry()inalphatools.ics. Concretely, thePipelinefactorReturns().demean(groupby=Sector())is equivalent (save a corner case on NaN treatment) to the expressionindneutralize(returns, IndClass.sector). If you do not specifically pass a token forg, the default ofIndClass.industryis applied.ts_max(x, days): the per asset time series max on x over the trailing days (alsots_min(...)).max(a, b): The paper says thatmaxis an alias forts_max(a, b); I think this is an error. Alphas 71, 73, 76, 87, and 96 do not parse withmaxas alias forts_max. Rather I believe thatmaxmeans elem
Related Skills
claude-opus-4-5-migration
112.2kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
model-usage
354.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
feishu-drive
354.0k|
things-mac
354.0kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
