Laurae
Advanced High Performance Data Science Toolbox for R by Laurae
Install / Use
/learn @Laurae2/LauraeREADME
Laurae
Advanced High Performance Data Science Toolbox for R by Laurae
me = wants download
devtools::install_github("Laurae2/Laurae")
Latest News (DD/MM/YYYY)
24/03/2017: Added Xgboard, an interactive dashboard for visualizing xgboost training, whether you are on computer, on your phone, on a tablet... by setting up a server accessible using a web browser (Google Chrome, Firefox...). Supports only Accuracy and Timing, more to come soon!

04/03/2017: Added Deep Forest implementation in R using xgboost, which may provide similar performance versus very simple Convolutional Neural Networks (CNNs), and slightly better results than boosted models. You can find the paper here. Supported: Complete-Random Tree Forest, Cascade Forest, Multi-Grained Scanning, Deep Forest. You can use Gradient Boosting to get a sort of "Deep Boosting" model.
Benchmark on MNIST 2,000 samples for training, 10,000 samples for testing, i7-4600U, 3-fold cross-validation (Cascade Forest with poor parameters for speed, Multi-Grained Scanning with poor parameters for speed):
| Model | Features | Accuracy | Training Time | Model Size | | --- | ---: | ---:| ---: | --- | | Cascade Forest (xgboost) | 784 | 89.91%<br>6th iteration | 637.264s<br>11th iteration | Forest: 274,951,008 bytes | | Boosted Trees (xgboost) | 784 | 90.53%<br>250th iteration | 267.884s<br>300 iterations | Boost: NA | | "Deep Forest" (xgboost)<br>=> Multi-Grained Scanning<br>=>Cascade Forest | Scan: 28x28<br>Forest: 2404 | 91.46%<br>5 iterations | Scan: 449.593s<br>Forest (8): 1135.937s | Scan: 256,419,396 bytes<br>Forest: 273,624,912 bytes | | "Deep Boosting" (xgboost)<br>=> Multi-Grained Scanning<br>=>Boosted Trees | Scan: 28x28<br>Boost: 2404 | 92.41%<br>215 iterations | Scan: 449.593s<br>Boost (265): 852.360s | Scan: 256,419,396 bytes<br>Boost: NA | | LeNet (MXnet + R w/ Intel MKL) | 28x28 | 94.74%<br>50 epochs | 647.638s<br>50 epochs | CNN: NA |

10/02/2017: Added Partial Dependence Analysis, currently a skeleton but I will build more on it. It is fully working for the analysis of single observations against an amount of features you specify. The multiple observation version is not yet working when it comes to analyzing statistically the results.
30/01/2017: Added "Lextravagenza", a machine learning model based on xgboost ignoring past gradient/hessian for optimization, but allowing dynamic trees to outperform small boosted trees.
09/01/2017: My LightGBM PR for easy installation in R has been merged in LightGBM official repository. When I will get time to work more on it (harvest metric, harvest feature importance, save/load models), I will update this package and get rid of the old LightGBM wrapper. This way, one will be able to use the latest versions of LightGBM, instead of being stuck with the (old) PR 33 of LightGBM.
08/01/2017: I'm starting to work on an automated machine learning model / stacker.
What is Data Science

What can I do with it?
Mostly... in a nutshell:
| What? | Can you do? | | --- | --- | | Supervised Learning | Deep Forest implementation: Complete-Random Tree Forest, Cascade Forest, Multi-Grained Scanning, Deep Forest. <br> Automated machine learning (feature selection + hyperparamter tuning) <br> xgboost <br> LightGBM (training from binary, feature importance, prediction) <br> Rule-based model on outliers (univariate, bivariate) <br> Feature engineering assistant <br> Interactive xgboost feature importance <br> Repeated cross-validation <br> Symbolic loss function derivation <br> Interactive split feature engineering assistant <br> Laurae's Lextravagenza (dynamic boosted trees) <br> Partial dependency analysis on single observations for finding insights | | Unsupervised Learning | Automated t-SNE | | Automated Reporting for Machine Learning | Linear regression <br> Unbiased xgboost regression/classification | | Interactive Analysis | Interactive loss function symbolic derivation <br> interactive "I'm Feeling Lucky" ggplot <br> Interactive 3djs/Plotly <br> Interactive Brewer's Paletttes, Xgboard | | Optimization | Cross-Entropy optimization combined with Elite optimization | | data.table improvements | up to 3X memory efficiency without even a minor cost in CPU time | | Plot massive amounts of data without being slow | tableplots tableplots tableplots tableplots tableplots | | SVMLight I/O (external package) | C++ implementation of SVMLight reading/saving for dgCMatrix (sparse column-compressed format) |
Supervised Learning:
- Deep Forest Implementation: first implementation in R of Complete-Random Tree Forest, Cascade Forest, Multi-Grained Scanning, and Deep Forest. Read more on this paper.
- (Soon Deprecated) Use LightGBM in R (first wrapper available in R for LightGBM) tuned for maximum I/O without using in-memory dataset moves (which is both a good and bad thing! - 10GB of data takes 4 mins of travel in a HDD) and use feature importance with smart and readable plots - I recommend using official LightGBM R Package which I contribute to
- Automated Machine Learning from a set of features and hyperparameters (provide algorithm functions, features, hyperparamters, and a stochastic optimizer does the job for you with full logging if required)
- Use a repeated cross-validated xgboost (Extreme Gradient Boosting)
- Get pretty interactive feature importance tables of xgboost ready-to-use for markdown documents
- Throw supervised rules using outliers anywhere you feel it appropriate (univariate, bivariate)
- Create cross-validated and repeated cross-validated folds for supervised learning with more options for creating them (like batch creation - those ones can be fed into my LightGBM R wrapper for extensive analysis of feature behavior)
- Feature Engineering Assistant (mostly non-linear version) using automated decision trees
- Dictionary of loss functions and ready to input into xgboost (currently: Absolute Error, Squared Error, Cubic Error, Loglikelihood Error, Poisson Error, Kullback-Leibler Error)
- Symbolic Derivaton for custom loss functions (finding gradient/hessian painlessly)
- Lextravagenza model (dynamic boosted trees) which are good for small boosting iterations, bad for high boosting iterations (good for diversity)
- Partial dependency analysis for single observation: the way to get insights on why a black box made a specific decision!
Unsupervised Learning:
- Auto-tune t-SNE (t-Distributed Stochastic Neighbor Embedding), but it comes already with premade hyperparameters tuned for minimal reproduction loss!
Automated Reporting for Machine Learning:
- Generate an in-depth automated report for linear regression with interactive elements.
- Generate an in-depth automated report for xgboost regression/classification with interactive elements, with unbiased feature importance computations
Interactive Analysis:
- Discover and optimize gradient and hessian functions interactively in real-time
- Plot up to 1 dependent variable, 2 independent variables, 2 conditioning variables, and 1 weighting variable for Exploratory Data Analysis using ggplot, in real-time
- Plot up to three variables for Exploratory Data Analysis using 3djs via NVD3, in real-time
- Plot several variables for Exploratory Data Analysis using 3djs via Plotly/ggplot, in real-time
- Discover rule-based (from decision trees) non-linear relationship between variables, with rules ready to be copied and pasted for data.tables
- Visualize interactively Color Brewer palettes with unlimited colors (unlike the original palettes), with ready to copy&paste color codes as vectors
- Monitor xgboost training in real time
Optimization:
- Do feature selection & hyperparameter optimization using Cross-Entropy optimization & Elite optimization
- Do the same optimization but with any variable (continuous, ordinal, discrete) for any function using fully personalized callbacks (which is both a great thing and a hassle for the user) and a personalized training backend (by default it uses xgboost as the predictor for next steps, you can modify it by another (un)supervised machine learning model!)
- Symbolic Derivaton for custom loss functions (finding gradient/hessian painlessly)
Improvements & Extras:
- Improve data.table memory efficiency by up to 3X while keeping a large part of its performance (best of both worlds? isn't that insane?)
- Improve Cross-Entropy optimization by providing a more powerful frontend (at the expense of the user's necessary knowledge) in order to converge better on feature selection & but slower on hyperparameter optimization of black boxes
- Load sparse data directly as dgCMatrix (sparse matrix)
- Plot massive amount of data in an easily readable picture
- Add unlimited colors to the Color Brewer palettes
- Add the ability to add linear equation coefficient to ggplot facets
- Add multiplot ggplot
Sparsity SVMLight converter benchmark:
- Benchmark to convert a dgCMatrix with 2,500,000 rows and 8,500 columns (1.1GB in memory) => 5 minutes
- I think it needs minimum hours if not days for the other existing converters for such size.
- Currently not merged on this repository: see https://github.com/Laurae2/sparsity !
Nice pictures:
- Partial Dependence for single observation analysis (5-variate example):

clawhub
345.9kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
postkit
PostgreSQL-native identity, configuration, metering, and job queues. SQL functions that work with any language or driver
