Rodeo
RODEO: R Optimized Data Engineering Operations
Install / Use
/learn @AdrianAntico/RodeoREADME
Rodeo
R Optimized Data Engineering Operations
Note: see vignette for examples and parameter definitions
Install Rodeo
install.packages('bit64')
install.packages('data.table')
install.packages('collapse')
install.packages('timeDate')
install.packages('h2o')
install.packages('Rfast')
install.packages('combinat')
install.packages('nortest')
install.packages('lubridate')
install.packages('fBasics')
devtools::install_github("AdrianAntico/AutoNLP", upgrade = FALSE)
devtools::install_github("AdrianAntico/Rodeo", upgrade = FALSE)
Automated feature engineering using data.table and collapse
Character Type Data
CategoricalEncoding
- Nested random effects
- Actuarial buhlmann credibility
- Target encoding
- Weight of Evidence
- m-estimator
- poly encode
- backward_difference
- helmert
DummifyDT
- All levels
- Partital set of levels
Numeric Type Data
- Numeric transformations
- Interactions
Calendar Type Data
- Calendar variables
- Holiday variables
Cross Row Operations
- Lags and Rolling stats for numeric variables
- Differencing for numeric, date, and categorical variables
- Rolling modes for categorical variables
Data sets
- Partitioning
- Type conversion for modeling
Model Based Features
- Dimensionality reduction
- Clustering
- Word2Vec
- Anomaly detection
AutoLagRollStats() and AutoLagRollStatsScoring()
<details><summary>Code Example</summary> <p># Create fake Panel Data----
Count <- 1L
for(Level in LETTERS) {
datatemp <- AutoQuant::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 0L,
ZIP = 0L,
FactorCount = 0L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
datatemp[, Factor1 := eval(Level)]
if(Count == 1L) {
data <- data.table::copy(datatemp)
} else {
data <- data.table::rbindlist(list(data, data.table::copy(datatemp)))
}
Count <- Count + 1L
}
# Add scoring records
data <- Rodeo::AutoLagRollStats(
# Data
data = data,
DateColumn = "DateTime",
Targets = "Adrian",
HierarchyGroups = NULL,
IndependentGroups = c("Factor1"),
TimeUnitAgg = "days",
TimeGroups = c("days", "weeks", "months", "quarters"),
TimeBetween = NULL,
TimeUnit = "days",
# Services
RollOnLag1 = TRUE,
Type = "Lag",
SimpleImpute = TRUE,
# Calculated Columns
Lags = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1)), "quarters" = c(seq(1,2,1))),
MA_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1)), "quarters" = c(seq(1,2,1))),
SD_RollWindows = NULL,
Skew_RollWindows = NULL,
Kurt_RollWindows = NULL,
Quantile_RollWindows = NULL,
Quantiles_Selected = NULL,
Debug = FALSE)
</p>
</details>
<details><summary>Code Example</summary>
<p>
# Create fake Panel Data----
Count <- 1L
for(Level in LETTERS) {
datatemp <- AutoQuant::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 0L,
ZIP = 0L,
FactorCount = 0L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
datatemp[, Factor1 := eval(Level)]
if(Count == 1L) {
data <- data.table::copy(datatemp)
} else {
data <- data.table::rbindlist(list(data, data.table::copy(datatemp)))
}
Count <- Count + 1L
}
# Create ID columns to know which records to score
data[, ID := .N:1L, by = "Factor1"]
data.table::set(data, i = which(data[["ID"]] == 2L), j = "ID", value = 1L)
# Score records
data <- Rodeo::AutoLagRollStatsScoring(
# Data
data = data,
RowNumsID = "ID",
RowNumsKeep = 1,
DateColumn = "DateTime",
Targets = "Adrian",
HierarchyGroups = c("Store","Dept"),
IndependentGroups = NULL,
# Services
TimeBetween = NULL,
TimeGroups = c("days", "weeks", "months"),
TimeUnit = "day",
TimeUnitAgg = "day",
RollOnLag1 = TRUE,
Type = "Lag",
SimpleImpute = TRUE,
# Calculated Columns
Lags = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
MA_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
SD_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
Skew_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
Kurt_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
Quantile_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
Quantiles_Selected = c("q5","q10","q95"),
Debug = FALSE)
</p>
</details>
<details><summary>Function Description</summary>
<p>
<code>AutoLagRollStats()</code> builds lags and rolling statistics by grouping variables and their interactions along with multiple different time aggregations if selected. Rolling stats include mean, sd, skewness, kurtosis, and the 5th - 95th percentiles. This function was inspired by the distributed lag modeling framework but I wanted to use it for time series analysis as well and really generalize it as much as possible. The beauty of this function is inspired by analyzing whether a baseball player will get a basehit or more in his next at bat. One easy way to get a better idea of the likelihood is to look at his batting average and his career batting average. However, players go into hot streaks and slumps. How do we account for that? Well, in comes the functions here. You look at the batting average over the last N to N+x at bats, for various N and x. I keep going though - I want the same windows for calculating the players standard deviation, skewness, kurtosis, and various quantiles over those time windows. I also want to look at all those measure but by using weekly data - as in, over the last N weeks, pull in those stats too.
<code>AutoLagRollStatsScoring()</code> builds the above features for a partial set of records in a data set. The function is extremely useful as it can compute these feature vectors at a significantly faster rate than the non scoring version which comes in handy for scoring ML models. If you can find a way to make it faster, let me know.
</p> </details>AutoLagRollMode()
<details><summary>Code Example</summary> <p># NO GROUPING CASE: Create fake Panel Data----
Count <- 1L
for(Level in LETTERS) {
datatemp <- AutoQuant::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 0L,
ZIP = 0L,
FactorCount = 2L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
datatemp[, Factor1 := eval(Level)]
if(Count == 1L) {
data <- data.table::copy(datatemp)
} else {
data <- data.table::rbindlist(
list(data, data.table::copy(datatemp)))
}
Count <- Count + 1L
}
# NO GROUPING CASE: Create rolling modes for categorical features
data <- Rodeo::AutoLagRollMode(
data,
Lags = seq(1,5,1),
ModePeriods = seq(2,5,1),
Targets = c("Factor_1"),
GroupingVars = NULL,
SortDateName = "DateTime",
WindowingLag = 1,
Type = "Lag",
SimpleImpute = TRUE)
# GROUPING CASE: Create fake Panel Data----
Count <- 1L
for(Level in LETTERS) {
datatemp <- AutoQuant::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 0L,
ZIP = 0L,
FactorCount = 2L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
datatemp[, Factor1 := eval(Level)]
if(Count == 1L) {
data <- data.table::copy(datatemp)
} else {
data <- data.table::rbindlist(
list(data, data.table::copy(datatemp)))
}
Count <- Count + 1L
}
# GROUPING CASE: Create rolling modes for categorical features
data <- Rodeo::AutoLagRollMode(
data,
Lags = seq(1,5,1),
ModePeriods = seq(2,5,1),
Targets = c("Factor_1"),
GroupingVars = "Factor_2",
SortDateName = "DateTime",
WindowingLag = 1,
Type = "Lag",
SimpleImpute = TRUE)
</p>
</details>
<details><summary>Function Description</summary>
<p>
<code>AutoLagRollMode()</code> Generate lags and rolling modes for categorical variables
</p> </details>AutoDiffLagN()
<details><summary>Code Example</summary> <p>##############################
# Current minus lag1
##############################
# Create fake data
data <- AutoQuant::FakeDataGenerator(
Correlation = 0.70,
N = 50000,
ID = 2L,
FactorCount = 3L,
AddDate = TRUE,
ZIP = 0L,
TimeSeries = FALSE,
ChainLadderData = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Store Cols to diff
Cols <- names(data)[which(unlist(data[, lapply(.SD, is.numeric)]))]
# Clean data before running AutoDiffLagN
data <- Rodeo::ModelDataPrep(
data = data,
Impute = FALSE,
CharToFactor = FALSE,
FactorToChar = TRUE)
# Run function
data <- Rodeo::AutoDiffLagN(
data,
DateVariable = "DateTime",
GroupVariables = c("Factor_2"),
DiffVariables = Cols,
DiffDateVariables = "DateTime",
DiffGroupVariables = "Factor_1",
NLag1 = 0,
NLag2 = 1,
Sort = TRUE,
RemoveNA = TRUE)
##############################
# lag1 minus lag3
##############################
# Create fake data
data <- Au
