Laurae

Advanced High Performance Data Science Toolbox for R by Laurae

me = wants download

devtools::install_github("Laurae2/Laurae")

Latest News (DD/MM/YYYY)

24/03/2017: Added Xgboard, an interactive dashboard for visualizing xgboost training, whether you are on computer, on your phone, on a tablet... by setting up a server accessible using a web browser (Google Chrome, Firefox...). Supports only Accuracy and Timing, more to come soon!

xgboard

04/03/2017: Added Deep Forest implementation in R using xgboost, which may provide similar performance versus very simple Convolutional Neural Networks (CNNs), and slightly better results than boosted models. You can find the paper here. Supported: Complete-Random Tree Forest, Cascade Forest, Multi-Grained Scanning, Deep Forest. You can use Gradient Boosting to get a sort of "Deep Boosting" model.

Benchmark on MNIST 2,000 samples for training, 10,000 samples for testing, i7-4600U, 3-fold cross-validation (Cascade Forest with poor parameters for speed, Multi-Grained Scanning with poor parameters for speed):

| Model | Features | Accuracy | Training Time | Model Size | | --- | ---: | ---:| ---: | --- | | Cascade Forest (xgboost) | 784 | 89.91% 6th iteration | 637.264s 11th iteration | Forest: 274,951,008 bytes | | Boosted Trees (xgboost) | 784 | 90.53% 250th iteration | 267.884s 300 iterations | Boost: NA | | "Deep Forest" (xgboost) => Multi-Grained Scanning =>Cascade Forest | Scan: 28x28 Forest: 2404 | 91.46% 5 iterations | Scan: 449.593s Forest (8): 1135.937s | Scan: 256,419,396 bytes Forest: 273,624,912 bytes | | "Deep Boosting" (xgboost) => Multi-Grained Scanning =>Boosted Trees | Scan: 28x28 Boost: 2404 | 92.41% 215 iterations | Scan: 449.593s Boost (265): 852.360s | Scan: 256,419,396 bytes Boost: NA | | LeNet (MXnet + R w/ Intel MKL) | 28x28 | 94.74% 50 epochs | 647.638s 50 epochs | CNN: NA |

Deep Forest

10/02/2017: Added Partial Dependence Analysis, currently a skeleton but I will build more on it. It is fully working for the analysis of single observations against an amount of features you specify. The multiple observation version is not yet working when it comes to analyzing statistically the results.

30/01/2017: Added "Lextravagenza", a machine learning model based on xgboost ignoring past gradient/hessian for optimization, but allowing dynamic trees to outperform small boosted trees.

09/01/2017: My LightGBM PR for easy installation in R has been merged in LightGBM official repository. When I will get time to work more on it (harvest metric, harvest feature importance, save/load models), I will update this package and get rid of the old LightGBM wrapper. This way, one will be able to use the latest versions of LightGBM, instead of being stuck with the (old) PR 33 of LightGBM.

08/01/2017: I'm starting to work on an automated machine learning model / stacker.

What is Data Science

What can I do with it?

Mostly... in a nutshell:

| What? | Can you do? | | --- | --- | | Supervised Learning | Deep Forest implementation: Complete-Random Tree Forest, Cascade Forest, Multi-Grained Scanning, Deep Forest. Automated machine learning (feature selection + hyperparamter tuning) xgboost LightGBM (training from binary, feature importance, prediction) Rule-based model on outliers (univariate, bivariate) Feature engineering assistant Interactive xgboost feature importance Repeated cross-validation Symbolic loss function derivation Interactive split feature engineering assistant Laurae's Lextravagenza (dynamic boosted trees) Partial dependency analysis on single observations for finding insights | | Unsupervised Learning | Automated t-SNE | | Automated Reporting for Machine Learning | Linear regression Unbiased xgboost regression/classification | | Interactive Analysis | Interactive loss function symbolic derivation interactive "I'm Feeling Lucky" ggplot Interactive 3djs/Plotly Interactive Brewer's Paletttes, Xgboard | | Optimization | Cross-Entropy optimization combined with Elite optimization | | data.table improvements | up to 3X memory efficiency without even a minor cost in CPU time | | Plot massive amounts of data without being slow | tableplots tableplots tableplots tableplots tableplots | | SVMLight I/O (external package) | C++ implementation of SVMLight reading/saving for dgCMatrix (sparse column-compressed format) |

Supervised Learning:

Deep Forest Implementation: first implementation in R of Complete-Random Tree Forest, Cascade Forest, Multi-Grained Scanning, and Deep Forest. Read more on this paper.
(Soon Deprecated) Use LightGBM in R (first wrapper available in R for LightGBM) tuned for maximum I/O without using in-memory dataset moves (which is both a good and bad thing! - 10GB of data takes 4 mins of travel in a HDD) and use feature importance with smart and readable plots - I recommend using official LightGBM R Package which I contribute to
Automated Machine Learning from a set of features and hyperparameters (provide algorithm functions, features, hyperparamters, and a stochastic optimizer does the job for you with full logging if required)
Use a repeated cross-validated xgboost (Extreme Gradient Boosting)
Get pretty interactive feature importance tables of xgboost ready-to-use for markdown documents
Throw supervised rules using outliers anywhere you feel it appropriate (univariate, bivariate)
Create cross-validated and repeated cross-validated folds for supervised learning with more options for creating them (like batch creation - those ones can be fed into my LightGBM R wrapper for extensive analysis of feature behavior)
Feature Engineering Assistant (mostly non-linear version) using automated decision trees
Dictionary of loss functions and ready to input into xgboost (currently: Absolute Error, Squared Error, Cubic Error, Loglikelihood Error, Poisson Error, Kullback-Leibler Error)
Symbolic Derivaton for custom loss functions (finding gradient/hessian painlessly)
Lextravagenza model (dynamic boosted trees) which are good for small boosting iterations, bad for high boosting iterations (good for diversity)
Partial dependency analysis for single observation: the way to get insights on why a black box made a specific decision!

Unsupervised Learning:

Auto-tune t-SNE (t-Distributed Stochastic Neighbor Embedding), but it comes already with premade hyperparameters tuned for minimal reproduction loss!

Automated Reporting for Machine Learning:

Generate an in-depth automated report for linear regression with interactive elements.
Generate an in-depth automated report for xgboost regression/classification with interactive elements, with unbiased feature importance computations

Interactive Analysis:

Discover and optimize gradient and hessian functions interactively in real-time
Plot up to 1 dependent variable, 2 independent variables, 2 conditioning variables, and 1 weighting variable for Exploratory Data Analysis using ggplot, in real-time
Plot up to three variables for Exploratory Data Analysis using 3djs via NVD3, in real-time
Plot several variables for Exploratory Data Analysis using 3djs via Plotly/ggplot, in real-time
Discover rule-based (from decision trees) non-linear relationship between variables, with rules ready to be copied and pasted for data.tables
Visualize interactively Color Brewer palettes with unlimited colors (unlike the original palettes), with ready to copy&paste color codes as vectors
Monitor xgboost training in real time

Optimization:

Do feature selection & hyperparameter optimization using Cross-Entropy optimization & Elite optimization
Do the same optimization but with any variable (continuous, ordinal, discrete) for any function using fully personalized callbacks (which is both a great thing and a hassle for the user) and a personalized training backend (by default it uses xgboost as the predictor for next steps, you can modify it by another (un)supervised machine learning model!)
Symbolic Derivaton for custom loss functions (finding gradient/hessian painlessly)

Improvements & Extras:

Improve data.table memory efficiency by up to 3X while keeping a large part of its performance (best of both worlds? isn't that insane?)
Improve Cross-Entropy optimization by providing a more powerful frontend (at the expense of the user's necessary knowledge) in order to converge better on feature selection & but slower on hyperparameter optimization of black boxes
Load sparse data directly as dgCMatrix (sparse matrix)
Plot massive amount of data in an easily readable picture
Add unlimited colors to the Color Brewer palettes
Add the ability to add linear equation coefficient to ggplot facets
Add multiplot ggplot

Sparsity SVMLight converter benchmark:

Benchmark to convert a dgCMatrix with 2,500,000 rows and 8,500 columns (1.1GB in memory) => 5 minutes
I think it needs minimum hours if not days for the other existing converters for such size.
Currently not merged on this repository: see https://github.com/Laurae2/sparsity !

Nice pictures:

Partial Dependence for single observation analysis (5-variate example):

![Partial Dependence for single observation analysis](https://cloud.githubusercontent.com/assets/9083669/22832403/32b36d6

Laurae

Install / Use

README

Laurae

me = wants download

Latest News (DD/MM/YYYY)

What is Data Science

What can I do with it?

Related Skills