The tune and dials Packages

Data Science
R
Modeling
Author

Robert Lankford

Published

May 23, 2023

This post covers the tune and dials packages, which focus on defining and optimizing model hyperparameters.

Setup

Packages

The following packages are required:

Data

For these packages, I utilized the car_prices data set from the modeldata package in a few examples.

data("car_prices", package = "modeldata")

car_prices_tbl <- as_tibble(car_prices)

car_prices_tbl
# A tibble: 804 × 18
    Price Mileage Cylin…¹ Doors Cruise Sound Leather Buick Cadil…² Chevy Pontiac
    <dbl>   <int>   <int> <int>  <int> <int>   <int> <int>   <int> <int>   <int>
 1 22661.   20105       6     4      1     0       0     1       0     0       0
 2 21725.   13457       6     2      1     1       0     0       0     1       0
 3 29143.   31655       4     2      1     1       1     0       0     0       0
 4 30732.   22479       4     2      1     0       0     0       0     0       0
 5 33359.   17590       4     2      1     1       1     0       0     0       0
 6 30315.   23635       4     2      1     0       0     0       0     0       0
 7 33382.   17381       4     2      1     1       1     0       0     0       0
 8 30251.   27558       4     2      1     0       1     0       0     0       0
 9 30167.   25049       4     2      1     0       0     0       0     0       0
10 27060.   17319       4     4      1     0       1     0       0     0       0
# … with 794 more rows, 7 more variables: Saab <int>, Saturn <int>,
#   convertible <int>, coupe <int>, hatchback <int>, sedan <int>, wagon <int>,
#   and abbreviated variable names ¹​Cylinder, ²​Cadillac

Referencing previous entries in this series on the rsample and recipes packages, a 70/30 train/test initial_split() on both data sets is taken, a few pre-processing steps are applied on the training data sets to create recipe() objects, and those objects are passed to the prep() and juice() functions. This creates a processed training data set for each original data set.

set.seed(1914)
car_prices_split_obj <- initial_split(car_prices, prop = 0.7)

car_prices_recipe_obj <- recipe(Price ~ ., training(car_prices_split_obj)) %>% 
  step_mutate_at(Cylinder:wagon, fn = as.factor) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_naomit(all_predictors(), skip = FALSE)
car_prices_train_tbl <- car_prices_recipe_obj %>% 
  prep() %>% 
  juice()

car_prices_train_tbl
# A tibble: 562 × 35
   Mileage  Price Cylinder_X4 Cylinder…¹ Cylin…² Doors…³ Doors…⁴ Cruis…⁵ Cruis…⁶
     <int>  <dbl>       <dbl>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1    3828 37089.           1          0       0       1       0       0       1
 2   21778 36210.           0          1       0       0       1       0       1
 3   28408 12230.           1          0       0       1       0       1       0
 4    4463 17418.           1          0       0       0       1       0       1
 5   21020 13991.           1          0       0       1       0       1       0
 6   25218 23329.           1          0       0       0       1       0       1
 7   32914  8871.           1          0       0       0       1       0       1
 8   18419 20127.           0          1       0       0       1       0       1
 9   21128 14305.           1          0       0       0       1       1       0
10    1169 15636.           1          0       0       1       0       1       0
# … with 552 more rows, 26 more variables: Sound_X0 <dbl>, Sound_X1 <dbl>,
#   Leather_X0 <dbl>, Leather_X1 <dbl>, Buick_X0 <dbl>, Buick_X1 <dbl>,
#   Cadillac_X0 <dbl>, Cadillac_X1 <dbl>, Chevy_X0 <dbl>, Chevy_X1 <dbl>,
#   Pontiac_X0 <dbl>, Pontiac_X1 <dbl>, Saab_X0 <dbl>, Saab_X1 <dbl>,
#   Saturn_X0 <dbl>, Saturn_X1 <dbl>, convertible_X0 <dbl>,
#   convertible_X1 <dbl>, coupe_X0 <dbl>, coupe_X1 <dbl>, hatchback_X0 <dbl>,
#   hatchback_X1 <dbl>, sedan_X0 <dbl>, sedan_X1 <dbl>, wagon_X0 <dbl>, …
car_prices_test_tbl <- car_prices_recipe_obj %>% 
  prep() %>% 
  bake(new_data = testing(car_prices_split_obj))

car_prices_test_tbl
# A tibble: 242 × 35
   Mileage  Price Cylinder_X4 Cylinder…¹ Cylin…² Doors…³ Doors…⁴ Cruis…⁵ Cruis…⁶
     <int>  <dbl>       <dbl>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1   27558 30251.           1          0       0       1       0       0       1
 2   22814 24852.           1          0       0       0       1       0       1
 3   10014 27826.           1          0       0       0       1       0       1
 4   18464 29987.           1          0       0       0       1       0       1
 5   19830 29908.           1          0       0       0       1       0       1
 6   25357 26792.           1          0       0       0       1       0       1
 7   12090 38325.           1          0       0       1       0       0       1
 8   21167 35580.           1          0       0       1       0       0       1
 9   14568 30122.           1          0       0       0       1       0       1
10   11273 30354.           1          0       0       0       1       0       1
# … with 232 more rows, 26 more variables: Sound_X0 <dbl>, Sound_X1 <dbl>,
#   Leather_X0 <dbl>, Leather_X1 <dbl>, Buick_X0 <dbl>, Buick_X1 <dbl>,
#   Cadillac_X0 <dbl>, Cadillac_X1 <dbl>, Chevy_X0 <dbl>, Chevy_X1 <dbl>,
#   Pontiac_X0 <dbl>, Pontiac_X1 <dbl>, Saab_X0 <dbl>, Saab_X1 <dbl>,
#   Saturn_X0 <dbl>, Saturn_X1 <dbl>, convertible_X0 <dbl>,
#   convertible_X1 <dbl>, coupe_X0 <dbl>, coupe_X1 <dbl>, hatchback_X0 <dbl>,
#   hatchback_X1 <dbl>, sedan_X0 <dbl>, sedan_X1 <dbl>, wagon_X0 <dbl>, …

Models

Again referencing previous entries in this series, this time on the parsnip and yardstick packages, a basic Random Forest model is built on the training data using the rand_forest() function and “random” values for its hyperparameters. We have no guarantee that these values are any good, but we will explore that more later.

set.seed(1915)

mod_rf_fit <- rand_forest() %>% 
  set_mode("regression") %>% 
  set_engine("ranger") %>% 
  set_args(mtry = 2, trees = 10, min_n = 5) %>% 
  fit(Price ~ ., data = car_prices_train_tbl)

mod_rf_fit
parsnip model object

Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2,      x), num.trees = ~10, min.node.size = min_rows(~5, x), num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 

Type:                             Regression 
Number of trees:                  10 
Sample size:                      562 
Number of independent variables:  34 
Mtry:                             2 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       20428937 
R squared (OOB):                  0.7873951 

Model fit statistics are calculated using the yardstick::metrics() function using the default regression metrics.

mod_rf_fit %>% 
  predict(car_prices_test_tbl) %>% 
  bind_cols(car_prices_test_tbl) %>% 
  metrics(truth = Price, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard    4042.   
2 rsq     standard       0.882
3 mae     standard    2853.   

These metrics are pretty good, but could they be better? The Random Forest model was built using the completely arbitrary values for its hyperparameters, which directly impact how accurate the model can (and cannot) be. Can better performance be obtained by “tuning” or “optimizing” these hyperparameters?

Background

While not all modeling algorithms have hyperparameters, those that do can achieve impressive performance if, among many other variables, the hyperparameters are “tuned” appropriately. “Regular” model parameters, such as the coefficients in a plain-vanilla linear regression, are estimated directly from the data and are internal to the model. Hyperparameters, on the other hand, are external to the model and must be explicitly specified before the model can be built. Examples include the mtry, trees, and min_n arguments to the Random Forest previously shown.

Different values or combinations of values of hyperparameters may or may not be better for different data sets. How do we know which values or combinations of values are “best”? Also, what does “best” mean? The short answer is, at least at first, you cannot be sure of the “best” hyperparameter values. “Best” usually means “optimal” for some defined metric. For example, we may want the hyperparameter values that result in a minimized RMSE value for a regression model or a maximized Area Under the ROC Curve for a classification model.

Often, the process involving finding the “optimal” set of hyperparameter values consists of the following steps:

  1. Split the training data set into n pieces, called “folds”
  2. Create a grid of possible hyperparameter value combinations
  3. For each combination, fit a model on n-1 of the folds and calculate the performance metric on the n-th fold
  4. Repeat step 3 until each fold has been used to calculate the performance metric
  5. Take the average of the n performance metrics
  6. For each combination, compare the averaged performance metric
  7. Select the combination with the optimized averaged performance metric

Hyperparameter Definitions (dials)

For each model algorithm in the parsnip package, the associated tunable hyperparameters are defined in the dials package. They are functions with the same name as the parameter argument. The functions below define the possible range of values the hyperparameters of the Random Forecast could take. Each function contains a range of possible values that the hyperparameter can take. Note that these ranges are somewhat subjective, and are based on research done by the tidymodels team.

trees

The value of the trees hyperparameter in the Random Forest can range from 1 tree to 2,000 trees. These are the number of individual trees built for the forest.

# Trees (quantitative)
Range: [1, 2000]

min_n

The value of the min_n hyperparameter in the Random Forest can range from 2 to 40. These are the number of observations required to be in a node for that node to be allowed to split further.

Minimal Node Size (quantitative)
Range: [2, 40]

mtry

The mtry hyperparameter is different than the previous two. While the lower bound of the range is 1, the default upper bound is unknown. mtry is the number of variables that are randomly sampled at each split in a tree. Since a data set can have a few, dozens, hundreds, or even thousands of variables, there are many possibilities for the maximum number of variables to randomly sample at each split. Hence, the ? as an upper bound.

# Randomly Selected Predictors (quantitative)
Range: [1, ?]

Fortunately, the dials package does provide tools for determining this upper bound for a specific data set. This will be explored later.

Hyperparameter Tuning (tune)

The tune package contains a suite of functions for tuning the hyperparameters housed in the dials package. The two packages are specifically designed to work together.

Typically, the process of hyperparameter tuning follows these steps:

  1. Specify which hyperparameters of a parsnip model are to be tuned
  2. Create a grid of possible values for each hyperparameter
  3. Find the “optimal” set of hyperparameters (as outlined previously)
  4. Finalize the model specification with the optimal hyperparameters
  5. Fit the model

Prepare Tuning Parameters

The first step in tuning the hyperparameters of a parsnip model is to specify which hyperparameters for a specific algorithm are to be tuned. This can be done by setting each hyperparameter argument equal to the tune function within the set_args() function. The example below shows a standard Random Forest model specification as before, but this time with the mtry, trees, and min_n arguments set to tune().

mod_rf_spec <- rand_forest() %>% 
  set_mode("regression") %>% 
  set_engine("ranger") %>% 
  set_args(
    mtry  = tune(),
    trees = tune(),
    min_n = tune()
  )

mod_rf_spec
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()

Computational engine: ranger 

Passing this model specification into the tune_args() function confirms that the three hyperparameters are tunable and that they were specified correctly.

tune_args(mod_rf_spec)
# A tibble: 3 × 6
  name  tunable id    source     component   component_id
  <chr> <lgl>   <chr> <chr>      <chr>       <chr>       
1 mtry  TRUE    mtry  model_spec rand_forest <NA>        
2 trees TRUE    trees model_spec rand_forest <NA>        
3 min_n TRUE    min_n model_spec rand_forest <NA>        

As mentioned earlier, the mtry hyperparameter has an unknown upper bound of possible values. This upper bound can be calculated using the training data set. Since mtry is the number of predictor variables randomly sampled at each node, it is therefore limited by the number of predictor variables in the data set. Passing mtry() and the data set of predictor variables into the finalize() function will “finalize” the hyperparameter with an upper bound so that it is ready to be tuned.

finalize(mtry(), car_prices_train_tbl[ ,-2])
# Randomly Selected Predictors (quantitative)
Range: [1, 34]

Once finalized, the three hyperparameters are ready for the next step of tuning. Passing each of them (with mtry() being passed to the finalize() function) into the parameters() function will gather them together in a format that can be used to generate combinations of potential values to explore.

params <- parameters(
  finalize(mtry(), car_prices_train_tbl[ ,-2]),
  trees(),
  min_n()
)

params
Collection of 3 parameters for tuning

 identifier  type    object
       mtry  mtry nparam[+]
      trees trees nparam[+]
      min_n min_n nparam[+]

Notes

This post is based on a presentation that was given on the date listed. It may be updated from time to time to fix errors, detail new functions, and/or remove deprecated functions so the packages and R version will likely be newer than what was available at the time.

The R session information used for this post:

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.1.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] ggplot2_3.4.0   parsnip_1.0.3   recipes_1.0.4   dplyr_1.0.10   
[5] rsample_1.1.1   yardstick_1.1.0 tune_1.0.1      dials_1.1.0    
[9] scales_1.2.1   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9          lubridate_1.9.0     lattice_0.20-45    
 [4] listenv_0.8.0       tidyr_1.2.1         class_7.3-20       
 [7] digest_0.6.29       ipred_0.9-13        foreach_1.5.2      
[10] utf8_1.2.2          parallelly_1.32.1   ranger_0.14.1      
[13] R6_2.5.1            hardhat_1.2.0       evaluate_0.16      
[16] pillar_1.8.1        rlang_1.1.1         rstudioapi_0.14    
[19] furrr_0.3.1         DiceDesign_1.9      rpart_4.1.16       
[22] Matrix_1.4-1        rmarkdown_2.16      splines_4.2.1      
[25] gower_1.0.1         stringr_1.5.0       munsell_0.5.0      
[28] compiler_4.2.1      xfun_0.40           pkgconfig_2.0.3    
[31] globals_0.16.2      htmltools_0.5.3     nnet_7.3-17        
[34] tidyselect_1.2.0    tibble_3.1.8        prodlim_2019.11.13 
[37] codetools_0.2-18    workflows_1.1.2     GPfit_1.0-8        
[40] future_1.29.0       fansi_1.0.3         withr_2.5.0        
[43] MASS_7.3-57         grid_4.2.1          jsonlite_1.8.0     
[46] gtable_0.3.1        lifecycle_1.0.3     magrittr_2.0.3     
[49] future.apply_1.10.0 cli_3.6.1           stringi_1.7.12     
[52] renv_0.16.0         timeDate_4022.108   ellipsis_0.3.2     
[55] lhs_1.1.6           generics_0.1.3      vctrs_0.6.3        
[58] lava_1.7.1          iterators_1.0.14    tools_4.2.1        
[61] glue_1.6.2          purrr_0.3.5         parallel_4.2.1     
[64] fastmap_1.1.0       survival_3.3-1      yaml_2.3.5         
[67] timechange_0.1.1    colorspace_2.0-3    knitr_1.40