regtools

Novel tools tools for regression/classification and machine learning

<div style="text-align: right"> *A model should be as simple as possible, but no simpler* -- Albert Einstein </div>

Various tools for Prediction and/or Description goals in regression, classification and machine learning.

Some of them are associated with my book, <i>From Linear Models to Machine Learning: Statistical Regression and Classification</i>, N. Matloff, CRC, 2017 (recipient of the Eric Ziegel Award for Best Book Reviewed in 2017, Technometrics), and with my forthcoming book, The Art of Machine Learning: Algorithms+Data+R, NSP, 2020 . But the tools are useful in general, independently of the books.

<font size="3"> <span style="color:red"> CLICK [HERE](#quickstart) for a **regtools** Quick Start! </span> </font>

OVERVIEW: FUNCTION CATEGORIES

See full function list by typing

> ?regtools

Here are the main categories:

Parametric modeling, including novel diagnostic plots
Classification, including novel methods for probability calibration
Machine learning, including advanced grid search
Dummies and R factors -- many utilities, e.g. conversion among types
Time series, image and text processing utilities
Recommender systems

Interesting datasets

<a name="quickstart">REGTOOLS QUICK START </a>

Here we will take a quick tour of a subset of regtools features, using datasets mlb and prgeng that are included in the package.

The Data

The mlb dataset consists of data on Major Leage baseball players. We'll use only a few features, to keep things simple:

> data(mlb)
> head(mlb)
             Name Team       Position Height
1   Adam_Donachie  BAL        Catcher     74
2       Paul_Bako  BAL        Catcher     74
3 Ramon_Hernandez  BAL        Catcher     72
4    Kevin_Millar  BAL  First_Baseman     72
5     Chris_Gomez  BAL  First_Baseman     73
6   Brian_Roberts  BAL Second_Baseman     69
  Weight   Age PosCategory
1    180 22.99     Catcher
2    215 34.69     Catcher
3    210 30.78     Catcher
4    210 35.43   Infielder
5    188 35.71   Infielder
6    176 29.39   Infielder

We'll predict player weight (in pounds) and position.

The prgeng dataset consists of data on Silicon Valley programmers and engineers in the 2000 US Census. It is available in several forms. We'll use the data frame version, and againn use only a few features to keep things simple:

> data(peFactors)
> names(peFactors)
 [1] "age"      "cit"      "educ"     "engl"     "occ"      "birth"   
 [7] "sex"      "wageinc"  "wkswrkd"  "yrentry"  "powspuma"
> pef <- peFactors[,c(1,3,5,7:9)]
> head(pef)
       age educ occ sex wageinc wkswrkd
1 50.30082   13 102   2   75000      52
2 41.10139    9 101   1   12300      20
3 24.67374    9 102   2   15400      52
4 50.19951   11 100   1       0      52
5 51.18112   11 100   2     160       1
6 57.70413   11 100   1       0       0

The various education and occupation codes may be obtained from the reference in the help page for this dataset.

We'll predict wage income. One cannot get really good accuracy with the given features, but this dataset will serve as a good introduction to these particular features of the package.

NOTE: The qe-series functions have been spun off to a separate package, ##qeML##.

The qe*() functions' output are ready for making prediction on new cases. We'll use this example:

> newx <- data.frame(age=32, educ='13', occ='106', sex='1', wkswrkd=30)

The qe*() series of regtools wrappers for machine learning functions

One of the features of regtools is its qe*() functions, a set of wrappers. Here 'qe' stands for "quick and easy." These functions provide convenient access to more sophisticated functions, with a very simple, uniform interface.

Note that the simplicity of the interface is just as important as the uniformity. A single call is all that is needed, no preparatory calls such as model definition.

The idea is that, given a new dataset, the analyst can quickly and easily try fitting a number of models in succession, say first k-Nearest Neighbors, then random forests:

# fit models
> knnout <- qeKNN(mlb,'Weight',k=25)
> rfout <- qeRF(mlb,'Weight')

# mean abs. pred. error on holdout set, in pounds
> knnout$testAcc
[1] 11.75644
> rfout$testAcc
[1] 12.6787

# predict a new case
> newx <- data.frame(Position='Catcher',Height=73.5,Age=26)
> predict(knnout,newx)
       [,1]
[1,] 204.04
> predict(rfout,newx)
      11
199.1714

How about some other ML methods?


> lassout <- qeLASSO(mlb,'Weight')
> lassout$testAcc
[1] 14.23122

# poly regression, degree 3
> polyout <- qePolyLin(mlb,'Weight',3)
> polyout$testAcc
[1] 13.55613

> nnout <- qeNeural(mlb,'Weight')
# ...
> nnout$testAcc
[1] 12.2537
# try some nondefault hyperparams
> nnout <- qeNeural(mlb,'Weight',hidden=c(200,200),nEpoch=50)
> nnout$testAcc
[1] 15.17982

More about the series

They automatically assess the model on a holdout set, using as loss Mean Absolute Prediction Error or Overall Misclassification Rate. (Setting holdout = NULL turns off this option.)
They handle R factors correctly in prediction, which some of the wrapped functions do not do by themselves (more on this point below).

Call form for all qe*() functions:

qe*(data,yName, options incl. method-specific)

Currently available:

qeLin() linear model, wrapper for lm()
qeLogit() logistic model, wrapper for glm(family=binomial)
qeKNN() k-Nearest Neighbors, wrapper for the regtools function kNN()
qeRF() random forests, wrapper for randomForest package
qeGBoost() gradient boosting on trees, wrapper for gbm package
qeSVM() SVM, wrapper for e1071 package
qeNeural() neural networks, wrapper for regtools function krsFit(), in turn wrapping keras package
qeLASSO() LASSO/ridge, wrapper for glmnet package
qePolyLin() polynomial regression, wrapper for the polyreg package, providing full polynomial models (powers and cross products), and correctly handling dummy variables (powers are not formed)
qePolyLog() polynomial logistic regression

The classification case is specified by there being an R factor in the second argument.

Other related functions

qeCompare()

Quick and easy comparison of several ML methods on the same data, e.g.:

qeCompare(mlb,'Weight',
   c('qeLin','qePolyLin','qeKNN','qeRF','qeLASSO','qeNeural'),25)
#       qeFtn  meanAcc
# 1     qeLin 13.30490
# 2 qePolyLin 13.33584
# 3     qeKNN 13.72708
# 4      qeRF 13.46515
# 5   qeLASSO 13.27564
# 6  qeNeural 14.01487

pcaQE()

Seamless incorporation of PCA dimension reduction into qe methods, e.g.

z <- pcaQE(0.6,d2,'tot','qeKNN',k=25,holdout=NULL)
newx <- d2[8,-13]
predict(z,newx)
#         [,1]
# [1,] 1440.44

What the qe-series functions wrap

Actual calls in the qe*-series functions.

# qeKNN()
regtools::kNN(xm, y, newx = NULL, k, scaleX = scaleX, classif = classif)
# qeRF()
randomForest::randomForest(frml, 
   data = data, ntree = nTree, nodesize = minNodeSize)
# qeSVM()  
e1071::svm(frml, data = data, cost = cost, gamma = gamma, decision.values = TRUE)
# qeLASSO()
glmnet::cv.glmnet(x = xm, y = ym, alpha = alpha, family = fam)
# qeGBoost()
gbm::gbm(yDumm ~ .,data=tmpDF,distribution='bernoulli',  # used as OVA
         n.trees=nTree,n.minobsinnode=minNodeSize,shrinkage=learnRate)
# qeNeural()  
regtools::krsFit(x,y,hidden,classif=classif,nClass=length(classNames),
      nEpoch=nEpoch)  # regtools wrapper to keras package
# qeLogit()
glm(yDumm ~ .,data=tmpDF,family=binomial)  # used with OVA
# qePolyLin()
regtools::penrosePoly(d=data,yName=yName,deg=deg,maxInteractDeg)
# qePolyLog()
polyreg::polyFit(data,deg,use="glm")

Linear model analysis in regtools

So, let's try that programmer and engineer dataset.

> lmout <- qeLin(pef,'wageinc') 
> lmout$testAcc
[1] 25520.6  # Mean Absolute Prediction Error on holdout set
> predict(lmout,newx)
      11 
35034.63

The fit assessment techniques in regtools gauge the fit of parametric models by comparing to nonparametric ones. Since the latter are free of model bias, they are very useful in assessing the parametric models. One of them plots parametric vs. k-NN fit:

> parvsnonparplot(lmout,qeKNN(pef,'wageinc',25))

We specified k = 25 nearest neighbors. Here is the plot:

result

Glancing at the red 45-degree line, we see some suggestion here that the linear model tends to underpredict at low and high wage values. If the analyst wished to use a linear model, she would investigate further (always a good idea before resorting to machine learning algorithms), possibly adding quadratic terms to the model.

We saw above an example of one such function, parvsnonparplot(). Another is nonparvarplot(). Here is the data prep:

data(peDumms)  # dummy variables version of prgeng
pe1 <- peDumms[c('age','educ.14','educ.16','sex.1','wageinc','wkswrkd')]

We will check the classical assumption of homoscedasticity, meaning that the conditional variance of Y given X is constant. The function <b>nonparvarplot()</b> plots the estimated conditional variance against the estimated conditional mean, both computed nonparametrically:

result

Though we ran the plot thinking of the homoscedasticity assumption, and indeed we do see larger variance at large mean values, this is much more remarkable, showing that there are interesting subpopulations within this data. Since there appear to be 6 clusters, and there are 6 occupations, the observed pattern may reflect this.

The package includes various other graphical diagnostic functions, such asnonparvsxplot().

By the way, violation of the homoscedasticity assumption won't i

Regtools

Install / Use

README