This post covers the workflows
package, which focuses on bundling together many of the steps previously covered in this series.
Setup
Packages
The following packages are required:
Data
I utilized the penguins data set from the modeldata package for these examples.
Note that the outcome variable is forced to be binary by assigning all the odd-numbered rows as “Chinstrap” and the even numbered rows as “Adelie”. This is done to (1) ensure that the outcome is binary so certain functions and techniques can be used and (2) make the outcome a little more difficult to predict to better demonstrate the difference in the construction of the two modeling algorithms used.
Additionally, referencing a previous post on the rsample
package, a 70/30 train/test initial_split()
on both data sets is taken. It is then passed to the training()
and testing()
functions to extract the training and testing data sets, respectively.
Code
data("penguins", package = "modeldata")
penguins_tbl <- penguins %>%
mutate(
#> Need a binary outcome
species = ifelse(row_number() %% 2 == 0, "Adelie", "Chinstrap"),
species = factor(species, levels = c("Adelie", "Chinstrap"))
)
set.seed(1914)
penguins_split_obj <- initial_split(penguins_tbl, prop = 0.7)
penguins_train_tbl <- training(penguins_split_obj)
penguins_train_tbl
# A tibble: 240 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Chinstrap Biscoe 36.5 16.6 181 2850 fema…
2 Adelie Biscoe 52.5 15.6 221 5450 male
3 Chinstrap Dream 49.7 18.6 195 3600 male
4 Chinstrap Biscoe 40.6 18.6 183 3550 male
5 Chinstrap Biscoe 44.9 13.8 212 4750 fema…
6 Chinstrap Biscoe 39.6 17.7 186 3500 fema…
7 Chinstrap Biscoe 45.8 14.2 219 4700 fema…
8 Chinstrap Dream 45.9 17.1 190 3575 fema…
9 Chinstrap Biscoe 46.1 13.2 211 4500 fema…
10 Chinstrap Biscoe 37.7 16 183 3075 fema…
# … with 230 more rows, and abbreviated variable name ¹body_mass_g
penguins_test_tbl <- testing(penguins_split_obj)
penguins_test_tbl
# A tibble: 104 × 7
species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Chinstrap Torgersen 40.3 18 195 3250 fema…
2 Chinstrap Torgersen 38.9 17.8 181 3625 fema…
3 Adelie Torgersen 37.8 17.3 180 3700 <NA>
4 Chinstrap Torgersen 34.6 21.1 198 4400 male
5 Adelie Torgersen 36.6 17.8 185 3700 fema…
6 Chinstrap Torgersen 38.7 19 195 3450 fema…
7 Chinstrap Torgersen 34.4 18.4 184 3325 fema…
8 Adelie Biscoe 37.7 18.7 180 3600 male
9 Adelie Biscoe 40.5 18.9 180 3950 male
10 Chinstrap Dream 39.5 17.8 188 3300 fema…
# … with 94 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Workflow
The workflows
package bundles many of the model development steps in the tidymodels
ecosystem together into a single object called a workflow
. Start by creating an empty workflow
using the workflow()
function.
wflw <- workflow()
wflw
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: None
Model: None
A workflow
can be used to bundle several steps together making it easier to keep track of intermediate steps and the final model object, as well as providing a quick and convenient way to experiment with different model algorithms, pre-processing steps, etc. The workflows
package is still under development with the following model development steps (called “stages”) currently being supported (additional steps may be developed in the future):
- Pre-Processing
- Model Fitting
Pre-Processing
The first stage in a workflow
is often to pre-process the training data. This can be achieved in several ways:
- Adding a formula
- Adding variables
- Adding a pre-processing
recipe()
Formula
If no pre-processing is desired, the add_formula()
function can be used to add the desired formula()
for the model. The result is a workflow
with the formula listed under “Preprocessor”.
wflw %>%
add_formula(species ~ .)
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: None
── Preprocessor ────────────────────────────────────────────────────────────────
species ~ .
In addition to adding a formula, an existing formula can be removed from a workflow
using the remove_formula()
function or updated to a different formula using the update_formula()
function.
Variables
An alternative method is to specify the outcome and predictor variables using the add_variables()
function. Helpers from the tidyselect
package can be used, if desired. Note that in using the everything()
function below for predictors
, the variable listed in outcomes
is removed from the data set first. This results in the predictors
argument being every other variable excluding the variable in the outcomes
argument. Note that including variables in a workflow
only specifies the names of the variables. No data for those variables is added yet.
wflw %>%
add_variables(outcomes = species, predictors = everything())
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Variables
Model: None
── Preprocessor ────────────────────────────────────────────────────────────────
Outcomes: species
Predictors: everything()
In addition to adding variables, existing variables can be removed from a workflow
using the remove_variables()
function or updated to different variables using the update_variables()
function. Again, this just removes or updates the names of the variables, it does not add any data on the variables.
Recipe
If pre-processing steps are desired before modeling, they can be added using a recipe()
from the recipes
package. More information can be found in a previous post
Two simple recipe
s are included below. The first one removes missing values from all predictors and normalizes all numeric variables that start with “bill”. The second one removes missing values from all predictors, normalizes all numeric predictors, and one-hot-encodes all nominal predictors.
recipe1_obj <- recipe(species ~ ., data = penguins_train_tbl) %>%
step_naomit(all_predictors(), skip = FALSE) %>%
step_normalize(starts_with("bill"))
recipe1_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Removing rows with NA values in all_predictors()
Centering and scaling for starts_with("bill")
recipe2_obj <- recipe(species ~ ., data = penguins_train_tbl) %>%
step_naomit(all_predictors(), skip = FALSE) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE)
recipe2_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Removing rows with NA values in all_predictors()
Centering and scaling for all_numeric_predictors()
Dummy variables from all_nominal_predictors()
A recipe
can be added to a workflow
using the add_recipe()
function. The result is a workflow
with the recipe
listed under “Preprocessor”.
recipe1_wflw <- wflw %>%
add_recipe(recipe1_obj)
recipe1_wflw
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: None
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_naomit()
• step_normalize()
In addition to adding a recipe
, an existing recipe
can be removed from a workflow
using the remove_recipe()
function or updated to different recipe
using the update_recipe()
function.
#> Method #1
recipe2_wflw <- recipe1_wflw %>%
remove_recipe() %>%
add_recipe(recipe2_obj)
#> Method #2
recipe2_wflw <- recipe1_wflw %>%
update_recipe(recipe2_obj)
recipe2_wflw
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: None
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_naomit()
• step_normalize()
• step_dummy()
Models
After adding a formula
or a recipe
to a workflow
, the next step is to add a model. Specifically, it should be a model specification (that is, a specification not fit on the actual data) created with the parsnip
package. More information can be found in a previous post.
Below, two model specifications are created: a logistic regression, using logistic_reg()
, and a random forest, using rand_forest()
. Note that the hyperparameters of the random forest are set to tune()
. This will be explored more later.
mod_logit_spec <- logistic_reg() %>%
set_engine("glm")
mod_logit_spec
Logistic Regression Model Specification (classification)
Computational engine: glm
mod_rf_spec <- rand_forest() %>%
set_mode("classification") %>%
set_engine("ranger") %>%
set_args(
mtry = tune(),
trees = tune(),
min_n = tune()
)
mod_rf_spec
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
min_n = tune()
Computational engine: ranger
A model specification can be added to a workflow
using the add_model()
function. The result is a workflow
with the recipe
, formula
, or variables listed under “Preprocessor” and the model specification under “Model”.
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_naomit()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Computational engine: glm
Once a model specification has been added to a workflow
, it can be passed to the parsnip::fit()
function with the training data set. A finalized model object will not be listed under “Model”.
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_naomit()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) islandDream islandTorgersen bill_length_mm
7.136341 -0.595247 -0.868933 -0.167308
bill_depth_mm flipper_length_mm body_mass_g sexmale
-0.988114 0.013772 -0.002021 -2.001235
Degrees of Freedom: 230 Total (i.e. Null); 223 Residual
Null Deviance: 320
Residual Deviance: 169.4 AIC: 185.4
In addition to adding a model specification, an existing model specification can be removed from a workflow
using the remove_model()
function or updated to different model specification using the update_model()
function.
#> Method #1
rf_spec_wflw <- logit_fit_wflw %>%
remove_model() %>%
add_model(mod_rf_spec)
#> Method #2
rf_spec_wflw <- logit_fit_wflw %>%
update_model(mod_rf_spec)
rf_spec_wflw
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_naomit()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
min_n = tune()
Computational engine: ranger
Note that a model specification with hyperparameters marked with tune()
was added to the workflow
. Attempting to run parsnip::fit()
on this model results in an error as the hyperparameters do not have actual values.
Error in ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~tune(), : Error: Invalid value for num.trees.
Tune Hyperparameters
At the time of writing, the workflows
package does not include a step for tuning hyperparameters. Therefore, hyperparameters must be tuned using the tune
and dials
packages. Details of using these packages can be found in a previous post. A brief summary is provided below.
For a random forest, the three hyperparameters are mtry()
, trees()
, and min_n()
. They are passed to the parameters()
function (after mtry()
is passed to finalize()
with the training data) to create a parameter collection.
params <- parameters(
finalize(mtry(), penguins_train_tbl[ ,-1]),
trees(),
min_n()
)
params
Collection of 3 parameters for tuning
identifier type object
mtry mtry nparam[+]
trees trees nparam[+]
min_n min_n nparam[+]
A grid of the hyperparameters is created using the grid_regular()
function.
grid_regular_tbl <- grid_regular(params)
grid_regular_tbl
# A tibble: 27 × 3
mtry trees min_n
<int> <int> <int>
1 1 1 2
2 3 1 2
3 6 1 2
4 1 1000 2
5 3 1000 2
6 6 1000 2
7 1 2000 2
8 3 2000 2
9 6 2000 2
10 1 1 21
# … with 17 more rows
In order to get the folds needed for tuning the hyperparameters, the processed training data needs to be extracted by the recipe
. Therefore, the recipe
needs to be removed from the workflow
and replaced with the variables.
rf_spec_vars_wflw <- rf_spec_wflw %>%
remove_recipe() %>%
add_variables(outcomes = species, predictors = everything())
rf_spec_vars_wflw
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Variables
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
Outcomes: species
Predictors: everything()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
min_n = tune()
Computational engine: ranger
From there, the recipe
is passed to the juice()
function to get the processed training data.
# A tibble: 231 × 10
bill_length…¹ bill_…² flipp…³ body_…⁴ species islan…⁵ islan…⁶ islan…⁷ sex_f…⁸
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 -1.35 -0.225 -1.41 -1.67 Chinst… 1 0 0 1
2 1.54 -0.726 1.36 1.52 Adelie 1 0 0 0
3 1.03 0.779 -0.440 -0.753 Chinst… 0 1 0 0
4 -0.612 0.779 -1.27 -0.814 Chinst… 1 0 0 0
5 0.166 -1.63 0.740 0.658 Chinst… 1 0 0 1
6 -0.793 0.327 -1.06 -0.876 Chinst… 1 0 0 1
7 0.328 -1.43 1.23 0.597 Chinst… 1 0 0 1
8 0.347 0.0263 -0.787 -0.784 Chinst… 0 1 0 1
9 0.383 -1.93 0.670 0.351 Chinst… 1 0 0 1
10 -1.14 -0.526 -1.27 -1.40 Chinst… 1 0 0 1
# … with 221 more rows, 1 more variable: sex_male <dbl>, and abbreviated
# variable names ¹bill_length_mm, ²bill_depth_mm, ³flipper_length_mm,
# ⁴body_mass_g, ⁵island_Biscoe, ⁶island_Dream, ⁷island_Torgersen, ⁸sex_female
Finally, the processed training data is passed to vfold_cv()
with v = 5
to create five folds.
# 5-fold cross-validation
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [184/47]> Fold1
2 <split [185/46]> Fold2
3 <split [185/46]> Fold3
4 <split [185/46]> Fold4
5 <split [185/46]> Fold5
The tune_grid()
function is then used to tune the hyperparameters for the random forest. The metric_set()
function is used to specify roc_auc()
as the optimization metric. While the tune
package cannot currently be used as an official step in a workflow
, a workflow
can be passed to functions from the the package.
mod_rf_tuned_regular_tbl <- tune_grid(
rf_spec_vars_wflw,
resamples = folds_tbl,
grid = grid_regular_tbl,
metrics = metric_set(roc_auc)
)
mod_rf_tuned_regular_tbl
# Tuning results
# 5-fold cross-validation
# A tibble: 5 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [184/47]> Fold1 <tibble [27 × 7]> <tibble [0 × 3]>
2 <split [185/46]> Fold2 <tibble [27 × 7]> <tibble [0 × 3]>
3 <split [185/46]> Fold3 <tibble [27 × 7]> <tibble [0 × 3]>
4 <split [185/46]> Fold4 <tibble [27 × 7]> <tibble [0 × 3]>
5 <split [185/46]> Fold5 <tibble [27 × 7]> <tibble [0 × 3]>
The “optimal” parameter combinations are shown below.
show_best(mod_rf_tuned_regular_tbl)
# A tibble: 5 × 9
mtry trees min_n .metric .estimator mean n std_err .config
<int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 3 1000 21 roc_auc binary 0.887 5 0.0245 Preprocessor1_Model14
2 6 2000 40 roc_auc binary 0.886 5 0.0249 Preprocessor1_Model27
3 3 2000 40 roc_auc binary 0.886 5 0.0231 Preprocessor1_Model26
4 3 2000 21 roc_auc binary 0.886 5 0.0252 Preprocessor1_Model17
5 1 2000 21 roc_auc binary 0.886 5 0.0234 Preprocessor1_Model16
The top parameter combination is extracted and then the tune()
placeholders are replaced with them using the finalize_workflow()
function.
mod_rf_best_params_tbl <- select_best(mod_rf_tuned_regular_tbl)
rf_spec_tuned_wflw <- rf_spec_wflw %>%
finalize_workflow(mod_rf_best_params_tbl)
rf_spec_tuned_wflw
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_naomit()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = 3
trees = 1000
min_n = 21
Computational engine: ranger
With actual hyperparameter values now in place, the workflow
is then passed to the parsnip::fit()
function with the training data set.
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_naomit()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~3L, x), num.trees = ~1000L, min.node.size = min_rows(~21L, x), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
Type: Probability estimation
Number of trees: 1000
Sample size: 231
Number of independent variables: 6
Mtry: 3
Target node size: 21
Variable importance mode: none
Splitrule: gini
OOB prediction error (Brier s.): 0.1294809
While this is not the most convenient of solutions, it still works. Perhaps in the future, there will be a tune_*
function added to the workflows
package that allows for seamless hyperparameter tuning integration with a workflow
.
Comparisons
With two models now being constructed, comparisons between them can be made. Note that, for simplicity, all missing values are removed from the test data.
penguins_test_no_missing_tbl <- tidyr::drop_na(penguins_test_tbl)
Specifics on performance metrics and how to calculate them using the yardstick package can be found in a previous post.
Logistic Regression
Below are the out-of-sample confusion matrix, performance metrics, and Area Under the Curve (AUC) for the Logistic Regression model. Focusing specifically on the AUC value of ~0.89, it would suggest that the Logistic Regression does a good job at modeling this data set.
logit_fit_wflw %>%
predict(penguins_test_tbl) %>%
bind_cols(penguins_test_no_missing_tbl) %>%
conf_mat(truth = species, estimate = .pred_class) %>%
summary()
# A tibble: 13 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.843
2 kap binary 0.684
3 sens binary 0.855
4 spec binary 0.830
5 ppv binary 0.855
6 npv binary 0.830
7 mcc binary 0.684
8 j_index binary 0.684
9 bal_accuracy binary 0.842
10 detection_prevalence binary 0.539
11 precision binary 0.855
12 recall binary 0.855
13 f_meas binary 0.855
Plotting the ROC Curves also supports the assertion that the Logistic Regression does a good job at plotting this data set.
Random Forest
Next are the out-of-sample confusion matrix, performance metrics, and Area Under the Curve (AUC) for the Random Forest model. Focusing specifically on the AUC value of ~0.93, it would suggest that the not only does the Random Forest do a good job at modeling this data set, it does a better job than the Logistic Regression.
rf_fit_wflw %>%
predict(penguins_test_tbl) %>%
bind_cols(penguins_test_no_missing_tbl) %>%
conf_mat(truth = species, estimate = .pred_class) %>%
summary()
# A tibble: 13 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.824
2 kap binary 0.646
3 sens binary 0.818
4 spec binary 0.830
5 ppv binary 0.849
6 npv binary 0.796
7 mcc binary 0.646
8 j_index binary 0.648
9 bal_accuracy binary 0.824
10 detection_prevalence binary 0.520
11 precision binary 0.849
12 recall binary 0.818
13 f_meas binary 0.833
Comparing the two ROC Curves also supports the conclusion that the Random Forest outperforms the Logistic Regression. The AUC is clearly larger for the Random Forest than for the Logistic Regression.s
Notes
This post is based on a presentation that was given on the date listed. It may be updated from time to time to fix errors, detail new functions, and/or remove deprecated functions so the packages and R version will likely be newer than what was available at the time.
The R session information used for this post:
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.1.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] yardstick_1.1.0 tune_1.0.1 dials_1.1.0 scales_1.2.1
[5] parsnip_1.0.3 recipes_1.0.4 dplyr_1.0.10 rsample_1.1.1
[9] workflows_1.1.2
loaded via a namespace (and not attached):
[1] tidyr_1.2.1 jsonlite_1.8.0 splines_4.2.1
[4] foreach_1.5.2 prodlim_2019.11.13 GPfit_1.0-8
[7] renv_0.16.0 yaml_2.3.5 globals_0.16.2
[10] ipred_0.9-13 pillar_1.8.1 lattice_0.20-45
[13] glue_1.6.2 digest_0.6.29 hardhat_1.2.0
[16] colorspace_2.0-3 htmltools_0.5.3 Matrix_1.4-1
[19] timeDate_4022.108 pkgconfig_2.0.3 lhs_1.1.6
[22] DiceDesign_1.9 listenv_0.8.0 purrr_0.3.5
[25] ranger_0.14.1 gower_1.0.1 lava_1.7.1
[28] timechange_0.1.1 tibble_3.1.8 farver_2.1.1
[31] generics_0.1.3 ggplot2_3.4.0 ellipsis_0.3.2
[34] withr_2.5.0 furrr_0.3.1 nnet_7.3-17
[37] cli_3.6.1 survival_3.3-1 magrittr_2.0.3
[40] evaluate_0.16 future_1.29.0 fansi_1.0.3
[43] parallelly_1.32.1 MASS_7.3-57 class_7.3-20
[46] tools_4.2.1 lifecycle_1.0.3 stringr_1.5.0
[49] munsell_0.5.0 compiler_4.2.1 rlang_1.1.1
[52] grid_4.2.1 iterators_1.0.14 rstudioapi_0.14
[55] labeling_0.4.2 rmarkdown_2.16 gtable_0.3.1
[58] codetools_0.2-18 R6_2.5.1 lubridate_1.9.0
[61] knitr_1.40 fastmap_1.1.0 future.apply_1.10.0
[64] utf8_1.2.2 stringi_1.7.12 parallel_4.2.1
[67] Rcpp_1.0.9 vctrs_0.6.3 rpart_4.1.16
[70] tidyselect_1.2.0 xfun_0.40