Code
library(tidymodels)
Chun Su
April 30, 2020
As a DS beginner, I first came across ML in R by studying the book Hands-On Machine Learning with R. The book mainly focuses on the package caret
with general introductions to packages like recipe
and h2o
. Most examples use the workflow in which feature engineering is performed by recipe
and the modeling/learning part is done using caret
.
It was a great pleasure to take the tidymodels workshop hosted by Dr. Alison Hill last week. tidymodels
was recently launched as a collection of packages for ML using tidyverse principles. It is built on recipes
for feature engineering and parsnip
as the major modeling package, and links ML steps together with workflow
.
In this post, I am going to present the general ML frameworks using caret
and tidymodels
, independently. The data used as an example is “Watson churn data” from modeldata
Since tidymodels is a collection of packages like tidyverse
, we can just use library(tidymodels)
to load all the required libraries for the tidymodels pipeline.
For the caret
pipeline, additional helper pacakges, like recipes
and rsample
, were needed to process the data. Most of those packages are already collected in the tidymodels
pipeline.
For the data, I used “Watson churn data” from modeldata which is also a part of tidymodels
.
library(modeldata)
data(wa_churn)
# quick view and summarize data
glimpse(wa_churn)
# visdat::vis_miss(wa_churn)
Hmisc::describe(wa_churn)
# relevel factors
wa_churn = wa_churn %>%
mutate(churn=relevel(churn, ref="No")) %>%
mutate(multiple_lines=relevel(multiple_lines, ref="No phone service")) %>%
mutate(internet_service=relevel(internet_service, ref="No")) %>%
mutate(online_security=relevel(online_security, ref="No internet service")) %>%
mutate(online_backup=relevel(online_backup, ref="No internet service")) %>%
mutate(device_protection=relevel(device_protection, ref="No internet service")) %>%
mutate(tech_support=relevel(tech_support, ref="No internet service")) %>%
mutate(streaming_tv=relevel(streaming_tv, ref="No internet service")) %>%
mutate(streaming_movies=relevel(streaming_movies, ref="No internet service")) %>%
mutate(contract=relevel(contract, ref="Month-to-month"))
# to simplify the case here, we are going to remove missing variable
wa_churn = wa_churn %>%
na.omit
Both frameworks use rsample::initial_split
to split the data into training and testing data. Here, we choose the standard 7:3 split between training and testing, with stratification on the target variable “churn”
To stratify on the numeric variables, we can add the breaks
parameter.
General feature engineering steps include
step_zv
, step_nzv
step_other
step_scale
, step_center
, step_normalize
step_novel
+ step_dummy
, step_integer
step_log
, step_YeoJohnson
, step_BoxCox
step_pca
step_medianimpute
, step_knnimpute
, step_bagimpute
Feature engineering is done by recipes
in both tidymodels
and caret
. The functions in recipes
starts with step_*
and create a blueprint for feature engineering. The complete list of step is at https://tidymodels.github.io/recipes/reference/index.html
The blueprint will not change the data until it is fit
in the modeling step. We can use prep
and bake
to see “transformed” data in data.frame.
One reason to use recipe
is to avoid data leakage. Data leakage is when information from outside the training data set is used to create the model.
Resampling methods split the training data into additional sets. It will generate train set and validation set. Typical resampling method include cross-validation (cv), repeated cross-validation (repeated cv), leave-one-out and bootstrapping (with replacement).
We can use rsample::vfold_cv
for both caret and tidymodels pipeline.
However to make above cv_folds
compatible with caret
, we need to used rsample2caret
to convert a trainControl
list
Or we can simply use caret function trainControl
function to generate split. However, no stratify option is available here.
A hyperparameter is a parameter whose value is set before the learning process begins. It is distinguished from other parameters by the fact that it is not used for fitting the machine to the training set. For different models, there are a different number of hyperparameters you can tune. Here I choose to use random forest to model the data. The hyperparameters for random forest from ranger
include
num.trees
or trees
max.depth
mtry
min.node.size
or min_n
replace
.sample.fraction
. Specifying sample.fraction
requires replace
being set as TRUEA rule of thumb to start is
num.trees
start with 10x p (p means number of features).max.depth
mtry
: sqrt(p) for classification and p/3 for regressionmin.node.size
default values of 1 for classification and 5 for regressionreplace
and sample.fraction
: Default is 1 for sampling with replacement and 0.632 for sampling without replacement.The hyperparameters can be checked by function args(rand_forest)
rf
method (from RandomForest) for caret
has only one hyperparameter (mtry) by default.
Here is the step where tidymodel and caret start to diverge in syntax. Typically, tidymodel builds a model using workflow
pipe which specifies formular/recipe and model, while caret uses train
to fit model.
tidymodel
default version of model fit fit_resamples
# without grid_tune -> fit_resamples() at train
rf_tm <- rand_forest() %>%
set_engine("ranger", importance="permutation") %>%
set_mode("classification")
rf_tm_wf <- workflow() %>%
add_model(rf_tm) %>%
add_recipe(rec)
set.seed(123)
default_tm_fit= rf_tm_wf %>%
fit_resamples(
resamples = cv_folds,
control = control_resamples(save_pred = TRUE)
)
grid version of model fit grid_tune
# with grid_tune -> set tune() at model, use tune_grid() at train
rf_tm <- rand_forest(
mtry=tune(),
trees=tune(),
min_n=tune()
) %>%
set_engine("ranger", importance="impurity") %>%
set_mode("classification")
rf_tm_wf <- workflow() %>%
add_model(rf_tm) %>%
add_recipe(rec)
set.seed(123)
grid_tm_fit = rf_tm_wf %>%
tune_grid(resamples = cv_folds,
grid = hyp_grid,
control = control_grid(save_pred = TRUE)
)
Notes: 1. control
specification will be control_grid()
in grid_tune()
2. grid
parameter here can also be a integer which test for top N parameters.
Follow the thread https://github.com/tidymodels/parsnip/issues/235 to find how to print out default hyperparameters.
caret
default version of model fit
grid version of model fit
Metrics are used to determine how good the model fit. For classification problem, accuracy and ROC/AUC are commonly used. For regression problem, RSEM is the most commonly used approach.
We used collect_metrics
in tidymodels
list results
stores metrics for caret
From the results, we can tell that train
fit 3 hyperparameters by default.
Besides model metrics, we also care about what predicted value of target variable is in training data.
tidymodels
To see predicted target value for data_train, we can use collect_predictions
.
Notes: collect_predictions() only works when specifying save_pred = TRUE
in control.
caret
For both caret and tidymodels, it is possible that each row of the original data point might be represented multiple times per tuning paramete if boostrap or repeated cv is used
For default fit, only one set of hyperparameters is specified, thus we can just apply the fitted model to data_test
. However, for grid fit, we end up with multiple sets of hyperparameters. Thus, before fitting the model, we need to pick the best set of hyperparameters based on metrics on training data (which is summarized using specified rsample method), then apply the best model to test_data
tidymodels
last_fit
is a function that is applied to workflow and fits to test data. By default, it generates predictions that can be reported by collect_prediction
(no need to specify control
in the fit). We can also use collect_metrics
to check the metrics in testing data.
To select best set of hyperparameters from grid_tune
, we use select_best
by specifying which metrics to use. Then we apply this set of hyperparameters to original workflow by finalize_workflow
. Finally, like default, apply last_fit
to the best workflow and get predictions and metrics for the testing data
caret
The predict
function can be directly applied to fitted model to test data. For grid fit, it will automatically detect the best hyperparameters (here mtry=5) and apply it to the testing data.
# default
test_prediction <- predict(
default_cr_fit,
newdata = data_test,
type = "prob") %>%
as_tibble() %>%
transmute(estimate=Yes) %>% # for binary result we can randomly pick one, it will be same roc_auc
add_column(churn = data_test$churn)
## auc
roc_auc(test_prediction, churn, estimate)$.estimate
## accuracy
test_prediction %>%
mutate(.pred=ifelse(estimate > 0.5, "Yes","No")) %>%
summarise(accuracy=mean(.pred==churn)) %>%
pull(accuracy)
# grid
# best hyp
grid_cr_fit$results %>%
slice(which.max(ROC))
test_prediction2 <- predict(
grid_cr_fit,
newdata = data_test,
type = "prob") %>%
as_tibble() %>%
transmute(estimate=Yes) %>% # for binary result we can randomly pick one, it will be same roc_auc
add_column(churn = data_test$churn)
## auc
roc_auc(test_prediction2, churn, estimate)$.estimate
## accuracy
test_prediction2 %>%
mutate(.pred=ifelse(estimate > 0.5, "Yes","No")) %>%
summarise(accuracy=mean(.pred==churn)) %>%
pull(accuracy)
Lastly, we can use fit result to find most important variables by vip
package or caret function varImp
. Be aware that, for tidymodels
different importance
specified in the model will result in different ranks
tidymodels
caret
The following table summarizes the ML workflow using caret versus tidymodels:
Special thanks to Amy Goodwin Davies who helped editing and proof-reading this post!