CloudForest

Fast, flexible, multi-threaded ensembles of decision trees for machine learning in pure Go (golang).

CloudForest allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical / categorical data with missing values. These include:

Breiman and Cutler's Random Forest for Classification and Regression
Adaptive Boosting (AdaBoost) Classification
Gradient Boosting Tree Regression and Two Class Classification
Hellinger Distance Trees for Classification
Entropy, Cost driven and Class Weighted classification
L1/Absolute Deviance Decision Tree regression
Improved Feature Selection via artificial contrasts with ensembles (ACE)
Roughly Balanced Bagging for Unbalanced Data
Improved robustness using out of bag cases and artificial contrasts.
Support for missing values via bias correction or three way splitting.
Proximity/Affinity Analysis suitable for manifold learning
A number of experimental splitting criteria

The Design Prioritizes:

Training speed
Performance on highly dimensional heterogeneous datasets (e.g. genetic and clinical data).
An optimized set of core functionality.
The flexibility to quickly implement new impurities and algorithms using the common core.
The ability to natively handle non numerical data types and missing values.
Use in a multi core or multi machine environment.

It can achieve quicker training times then many other popular implementations on some datasets. This is the result of cpu cache friendly memory utilization well suited to modern processors and separate, optimized paths to learn splits from binary, numerical and categorical data.

Benchmarks

CloudForest offers good general accuracy and the alternative and augmented algorithms it implements can offer reduced error rate for specific use cases including especially recovering a signal from noisy, high dimensional data prone to over-fitting and predicting rare events and unbalanced classes (both of which are typical in genetic studies of diseases). These methods should be included in parameter sweeps to maximize accuracy.

Error

(Work on benchmarks and optimization is ongoing, if you find a slow use case please raise an issue.)

Command line utilities to grow, apply and analyze forests and do cross validation are provided or CloudForest can be used as a library in go programs.

This Document covers command line usage, file formats and some algorithmic background.

Documentation for coding against CloudForest has been generated with godoc and can be viewed live at: http://godoc.org/github.com/ryanbressler/CloudForest

Pull requests, spelling corrections and bug reports are welcome; Code Repo and Issue tracker can be found at: https://github.com/ryanbressler/CloudForest

A google discussion group can be found at: https://groups.google.com/forum/#!forum/cloudforest-dev

CloudForest was created in the Shumelivich Lab at the Institute for Systems Biology.

(Build status includes accuracy tests on iris and Boston housing price datasets and multiple go versions.)

Installation

With go installed:

go get github.com/ryanbressler/CloudForest
go install github.com/ryanbressler/CloudForest/growforest
go install github.com/ryanbressler/CloudForest/applyforest

#optional utilities
go install github.com/ryanbressler/CloudForest/leafcount
go install github.com/ryanbressler/CloudForest/utils/nfold
go install github.com/ryanbressler/CloudForest/utils/toafm

To update to the latest version use the -u flag

go get -u github.com/ryanbressler/CloudForest
go install -u github.com/ryanbressler/CloudForest/growforest
go install -u github.com/ryanbressler/CloudForest/applyforest

#optional utilities
go install -u github.com/ryanbressler/CloudForest/leafcount
go install -u github.com/ryanbressler/CloudForest/utils/nfold
go install -u github.com/ryanbressler/CloudForest/utils/toafm

Quick Start

Data can be provided in a tsv based anotated feature matrix or in arff or libsvm formats with ".arff" or ".libsvm" extensions. Details are discussed in the Data File Formats section below and a few example data sets are included in the "data" directory.

#grow a predictor forest with default parameters and save it to forest.sf
growforest -train train.fm -rfpred forest.sf -target B:FeatureName

#grow a 1000 tree forest using, 16 cores and report out of bag error 
#with minimum leafSize 8 
growforest -train train.fm -rfpred forest.sf -target B:FeatureName -oob \
-nCores 16 -nTrees 1000 -leafSize 8

#grow a 1000 tree forest evaluating half the features as candidates at each 
#split and reporting out of bag error after each tree to watch for convergence
growforest -train train.fm -rfpred forest.sf -target B:FeatureName -mTry .5 -progress 

#growforest with weighted random forest
growforest -train train.fm -rfpred forest.sf -target B:FeatureName \
-rfweights '{"true":2,"false":0.5}'

#report all growforest options
growforest -h

#Print the (balanced for classification, least squares for regression error 
#rate on test data to standard out
applyforest -fm test.fm -rfpred forest.sf

#Apply the forest, report errorrate and save predictions
#Predictions are output in a tsv as:
#CaseLabel	Predicted	Actual
applyforest -fm test.fm -rfpred forest.sf -preds predictions.tsv

#Calculate counts of case vs case (leaves) and case vs feature (branches) proximity.
#Leaves are reported as:
#Case1 Case2 Count
#Branches Are Reported as:
#Case Feature Count
leafcount -train train.fm -rfpred forest.sf -leaves leaves.tsv -branches branches.tsv

#Generate training and testing folds
nfold -fm data.fm

#growforest with internal training and testing
growforest -train train_0.fm -target N:FeatureName -test test_0.fm

#growforest with internal training and testing, 10 ace feature selection permutations and
#testing performed only using significant features
growforest -train train_0.fm -target N:FeatureName -test test_0.fm -ace 10 -cutoff .05

Growforest Utility

growforest trains a forest using the following parameters which can be listed with -h

Parameter's are implemented using go's parameter parser so that boolean parameters can be set to true with a simple flag:

#the following are equivalent
growforest -oob
growforest -oob=true

And equals signs and quotes are optional for other parameters:

#the following are equivalent
growforest -train featurematrix.afm
growforest -train="featurematrix.afm"

Basic options

  -target="": The row header of the target in the feature matrix.
  -train="featurematrix.afm": AFM formated feature matrix containing training data.
  -rfpred="rface.sf": File name to output predictor forest in sf format.
  -leafSize="0": The minimum number of cases on a leaf node. If <=0 will be inferred to 1 for classification 4 for regression.
  -maxDepth=0: Maximum tree depth. Ignored if 0.
  -mTry="0": Number of candidate features for each split as a count (ex: 10) or portion of total (ex: .5). Ceil(sqrt(nFeatures)) if <=0.
  -nSamples="0": The number of cases to sample (with replacement) for each tree as a count (ex: 10) or portion of total (ex: .5). If <=0 set to total number of cases.
  -nTrees=100: Number of trees to grow in the predictor.
 
  -importance="": File name to output importance.

  -oob=false: Calculate and report oob error.

Advanced Options

  -blacklist="": A list of feature id's to exclude from the set of predictors.
  -includeRE="": Filter features that DON'T match this RE.
  -blockRE="": A regular expression to identify features that should be filtered out.
  -force=false: Force at least one non constant feature to be tested for each split as in scikit-learn.
  -impute=false: Impute missing values to feature mean/mode before growth.
  -nCores=1: The number of cores to use.
  -progress=false: Report tree number and running oob error.
  -oobpreds="": Calculate and report oob predictions in the file specified.
  -cpuprofile="": write cpu profile to file
  -multiboost=false: Allow multi-threaded boosting which may have unexpected results. (highly experimental)
  -nobag=false: Don't bag samples for each tree.
  -evaloob=false: Evaluate potential splitting features on OOB cases after finding split value in bag.
  -selftest=false: Test the forest on the data and report accuracy.
  -splitmissing=false: Split missing values onto a third branch at each node (experimental).
  -test="": Data to test the model on after training.

Regression Options

  -gbt=0: Use gradient boosting with the specified learning rate.
  -l1=false: Use l1 norm regression (target must be numeric).
  -ordinal=false: Use ordinal regression (target must be numeric).

Classification Options

  -adaboost=false: Use Adaptive boosting for classification.
  -balanceby="": Roughly balanced bag the target within each class of this feature.
  -balance=false: Balance bagging of samples by target class for unbalanced classification.
  -cost="": For categorical targets, a json string to float map of the cost of falsely identifying each category.
  -entropy=false: Use entropy minimizing classification (target must be categorical).
  -hellinge

CloudForest

Install / Use

README

CloudForest

Installation

Quick Start

Growforest Utility

Basic options

Advanced Options

Regression Options

Classification Options