This post covers the yardstick package, which focuses on providing model performance estimations.

Setup

Packages

The following packages are required:

Code

library(yardstick)
library(rsample)
library(recipes)
library(parsnip)

library(ggplot2)

Data

Throughout this series, I utilized the penguins data set from the modeldata package, which has a categorical outcome. Additionally, to demonstrate building models with a continuous outcome, I also utilized the car_prices data set from the modeldata package in a few examples.

Referencing previous entries in this series on the rsample and recipes packages, a 70/30 train/test initial_split() on both data sets is taken, a few pre-processing steps are applied on the training data sets to create recipe() objects, and those objects are passed to the prep() and juice() functions. This creates a processed training data set for each original data set.

Categorical
Continuous

Code

data("penguins", package = "modeldata")

set.seed(1914)
penguins_split_obj <- initial_split(penguins, prop = 0.7)

penguins_recipe_obj <- recipe(species ~ ., training(penguins_split_obj)) %>% 
  step_select(species, island, bill_length_mm, body_mass_g) %>% 
  step_naomit(all_predictors()) %>% 
  step_filter(species %in% c("Adelie", "Gentoo")) %>% 
  step_mutate(
    species = factor(species, levels = c("Adelie", "Gentoo")), 
    island = as.factor(island)
  )

penguins_train_tbl <- penguins_recipe_obj %>% 
  prep() %>% 
  juice()

penguins_train_tbl

# A tibble: 194 × 4
   species island    bill_length_mm body_mass_g
   <fct>   <fct>              <dbl>       <int>
 1 Adelie  Biscoe              36.5        2850
 2 Gentoo  Biscoe              52.5        5450
 3 Adelie  Biscoe              40.6        3550
 4 Gentoo  Biscoe              44.9        4750
 5 Adelie  Biscoe              39.6        3500
 6 Gentoo  Biscoe              45.8        4700
 7 Gentoo  Biscoe              46.1        4500
 8 Adelie  Biscoe              37.7        3075
 9 Gentoo  Biscoe              45.7        4400
10 Adelie  Torgersen           37.3        3775
# … with 184 more rows

Code

data("car_prices", package = "modeldata")

set.seed(1915)
car_prices_split_obj <- initial_split(car_prices, prop = 0.7)

car_prices_recipe_obj <- recipe(Price ~ ., training(car_prices_split_obj)) %>% 
  step_select(Price, Mileage, Doors, Leather) %>% 
  step_mutate(Doors = as.factor(Doors), Leather = as.factor(Leather))

car_prices_train_tbl <- car_prices_recipe_obj %>% 
  prep() %>% 
  juice()

car_prices_train_tbl

# A tibble: 562 × 4
    Price Mileage Doors Leather
    <dbl>   <int> <fct> <fct>  
 1 17978.   10986 4     0      
 2 20512.   16633 4     0      
 3 16507.   17451 4     1      
 4 25997.   21433 4     1      
 5 15129.   13828 4     1      
 6 12965.   29707 4     1      
 7 30575.   22298 4     1      
 8 21383.    7287 4     1      
 9 32053.    5144 4     0      
10 12208.   23512 2     1      
# … with 552 more rows

Background

In the previous post on the parsnip package, it was demonstrated how to build various models in the tidymodels ecosystem. The question that follows from that is “are the models any good?”. While a model being “good” is somewhat subjective and dependent on the objective of the model, there are several commonly used statistics that are used to estimate model performance. The yardstick package provides functions for calculating many of those statistics.

To use these functions, the following prerequisite steps (at a minimum) are required:

Build a model
Have another data set (such as a testing data set), or the original data set, on which to run the model and receive predictions

To demonstrate a few examples, two models were used: one with a continuous outcome and one with a categorical outcome.

Continuous Model
Categorical Model

set.seed(1914)
mod_cont_fit <- rand_forest() %>% 
  set_mode("regression") %>% 
  set_engine("randomForest") %>% 
  set_args(mtry = 2, trees = 250, min_n = 10) %>% 
  fit(Price ~ ., data = car_prices_train_tbl)

set.seed(1915)
mod_cat_fit <- rand_forest() %>% 
  set_mode("classification") %>% 
  set_engine("randomForest") %>% 
  set_args(mtry = 1, trees = 16, min_n = 55) %>% 
  fit(species ~ ., data = penguins_train_tbl)

Performance Statistics

The main purpose of the yardstick package is to calculate performance statistics. The documentation details the large number of metrics that are available. Below, a few of the key continuous and categorical metrics are shown.

Continuous Outcome

To calculate performance statistics on a continuous outcome, predictions of those outcomes are first needed. If a recipe() is used, first apply it to the testing data set to ensure that the data is in the correct format for the model. Then, pass the model and the testing data to the parsnip::predict.model_fit() function to get predictions. The actual values are also needed, and bind_cols() can be used to add the actual data to the predicted data. Note that, when using the yardstick package, the predicted outcome is always named .pred.

car_prices_test_tbl <- car_prices_recipe_obj %>% 
  prep() %>% 
  bake(testing(car_prices_split_obj))

car_prices_preds_tbl <- mod_cont_fit %>% 
  predict(car_prices_test_tbl) %>% 
  bind_cols(car_prices_test_tbl)

car_prices_preds_tbl

# A tibble: 242 × 5
    .pred  Price Mileage Doors Leather
    <dbl>  <dbl>   <int> <fct> <fct>  
 1 20972. 21725.   13457 2     0      
 2 31168. 29143.   31655 2     1      
 3 20963. 30315.   23635 2     0      
 4 18766. 30251.   27558 2     1      
 5 22288. 27060.   17319 4     1      
 6 22028. 26841.   10003 4     0      
 7 20616. 24852.   22814 4     1      
 8 21478. 26698.   23055 4     1      
 9 22978. 27241.   23204 4     1      
10 22049. 28416.   14613 4     1      
# … with 232 more rows

Once predictions are generated, performance metrics can be calculated. In the yardstick package, each performance metric function has the same interface. Pass the name of the column with the actual data to the truth argument and the name of the column with the predicted data to the estimate argument. Several examples are shown below, including:

car_prices_preds_tbl %>% 
  rmse(truth = Price, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       9577.

car_prices_preds_tbl %>% 
  mae(truth = Price, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard       7549.

car_prices_preds_tbl %>% 
  mape(truth = Price, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mape    standard        41.7

car_prices_preds_tbl %>% 
  rsq(truth = Price, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard      0.0206

Instead of calculating each metric one at a time, the yardstick::metrics() function can be used to calculate several at one time.

car_prices_preds_tbl %>% 
  metrics(truth = Price, estimate = .pred)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   9577.    
2 rsq     standard      0.0206
3 mae     standard   7549.

If additional or different metrics are desired, a custom yardstick::metrics() function can be created using the metric_set() function that contains any desired performance metric.

cont_metrics <- metric_set(rmse, mae, mape)

car_prices_preds_tbl %>% 
  cont_metrics(truth = Price, estimate = .pred)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      9577. 
2 mae     standard      7549. 
3 mape    standard        41.7

Categorical Outcome

To calculate performance statistics on a categorical outcome, predictions of those outcomes are first needed. If a recipe() is used, first apply it to the testing data set to ensure that the data is in the correct format for the model. Then, pass the model and the testing data to the predict.model_fit() function to get predictions. The actual values are also needed, and bind_cols() can be used to add the actual data to the predicted data. Note that, when using the yardstick package, the predicted outcome is always named .pred_class and is the predicted outcome class.

penguins_test_tbl <- penguins_recipe_obj %>% 
  prep() %>% 
  bake(testing(penguins_split_obj))

penguins_preds_tbl <- mod_cat_fit %>% 
  predict(penguins_test_tbl) %>% 
  bind_cols(penguins_test_tbl)

penguins_preds_tbl

# A tibble: 104 × 5
   .pred_class species island    bill_length_mm body_mass_g
   <fct>       <fct>   <fct>              <dbl>       <int>
 1 Adelie      Adelie  Torgersen           40.3        3250
 2 Adelie      Adelie  Torgersen           38.9        3625
 3 Adelie      Adelie  Torgersen           37.8        3700
 4 Adelie      Adelie  Torgersen           34.6        4400
 5 Adelie      Adelie  Torgersen           36.6        3700
 6 Adelie      Adelie  Torgersen           38.7        3450
 7 Adelie      Adelie  Torgersen           34.4        3325
 8 Adelie      Adelie  Biscoe              37.7        3600
 9 Adelie      Adelie  Biscoe              40.5        3950
10 Adelie      Adelie  Dream               39.5        3300
# … with 94 more rows

A common representation used to visualize the performance of a model with a categorical outcome is the confusion matrix. This can quickly be done by passing the predicted and actuals to the conf_mat() function, specifying the name of the column of actuals as the truth argument and the name of the column of predictions as the estimate argument.

penguins_preds_tbl %>% 
  conf_mat(truth = species, estimate = .pred_class)

          Truth
Prediction Adelie Gentoo
    Adelie     45      0
    Gentoo      3     32

Once predictions are generated, performance metrics can be calculated. As with the continuous predictions, pass the name of the column with the actual data to the truth argument and the name of the column with the predicted data to the estimate argument. Several examples are shown below, including:

penguins_preds_tbl %>% 
  sensitivity(truth = species, estimate = .pred_class)

# A tibble: 1 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 sensitivity binary         0.938

penguins_preds_tbl %>% 
  specificity(truth = species, estimate = .pred_class)

# A tibble: 1 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 specificity binary             1

penguins_preds_tbl %>% 
  recall(truth = species, estimate = .pred_class)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary         0.938

penguins_preds_tbl %>% 
  precision(truth = species, estimate = .pred_class)

# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary             1

penguins_preds_tbl %>% 
  accuracy(truth = species, estimate = .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.962

As with the continuous metrics, instead of calculating each metric one at a time, the yardstick::metrics() function can be used to calculate several at one time.

penguins_preds_tbl %>% 
  metrics(truth = species, estimate = .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.962
2 kap      binary         0.923

If additional or different metrics are desired, a custom yardstick::metrics() function can be created using the metric_set() function that contains any desired performance metric.

cat_metrics <- metric_set(sens, spec, accuracy)

penguins_preds_tbl %>% 
  cat_metrics(truth = species, estimate = .pred_class)

# A tibble: 3 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 sens     binary         0.938
2 spec     binary         1    
3 accuracy binary         0.962

For the categorical metrics, many of them can be calculated directly from the confusion matrix. Pass the confusion matrix to the summary() function to quickly calculate several performance metrics.

penguins_preds_tbl %>% 
  conf_mat(truth = species, estimate = .pred_class) %>% 
  summary()

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.962
 2 kap                  binary         0.923
 3 sens                 binary         0.938
 4 spec                 binary         1    
 5 ppv                  binary         1    
 6 npv                  binary         0.914
 7 mcc                  binary         0.926
 8 j_index              binary         0.938
 9 bal_accuracy         binary         0.969
10 detection_prevalence binary         0.562
11 precision            binary         1    
12 recall               binary         0.938
13 f_meas               binary         0.968

Performance Plots

The yardstick package also contains several plotting methods to quickly create nice visual representations of model performance.

Confusion Matrix

By passing the output of conf_mat() to the autoplot() function, a nicer representation of the confusion matrix will be created as a ggplot object.

penguins_preds_tbl %>% 
  conf_mat(truth = species, estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

Performance Curves

Another class of plots are the Receiver Operating Characteristic (ROC) Curve and the Precision-Recall (PR) Curve. These curves help to determine the “optimal” probability threshold (i.e., at what predicted probability should an observation be classified into a specific binary outcome class relative to the other?). The predicted probability of belonging to an outcome class, rather than the predicted outcome class itself, can be found by passing the model object and a data set to the parsnip::predict.model_fit() function and specifying the argument type = "prob".

penguins_preds_probs_tbl <- mod_cat_fit %>% 
  predict(penguins_test_tbl, type = "prob") %>% 
  bind_cols(penguins_test_tbl)

The ROC Curve and PR Curve can be plotting using the roc_curve() and pr_curve() functions, respectively, and passing the result to the autoplot() function.

Additionally, the Area Under the ROC Curve (AUC) and Area Under the PR Curve (AUPRC) can also be calculated using the roc_auc() and pr_auc() functions, respectively. These two methods are another performance metrics used to measure model performance.

ROC Curve
PR Curve

penguins_preds_probs_tbl %>% 
  roc_curve(truth = species, estimate = .pred_Adelie) %>% 
  autoplot()

penguins_preds_probs_tbl %>% 
  roc_auc(truth = species, estimate = .pred_Adelie)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.988

penguins_preds_probs_tbl %>% 
  pr_curve(truth = species, estimate = .pred_Adelie) %>% 
  autoplot()

penguins_preds_probs_tbl %>% 
  pr_auc(truth = species, estimate = .pred_Adelie)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 pr_auc  binary         0.995

Other Features

The previous sections cover many of the common uses of the yardstick package. There are a few additional features that can come in handy as well.

Metric Parameters

The first feature is changing the default values of metric functions. Certain metrics are only a function of the underlying data itself. Other metrics have parameters that must be supplied. One example is the F-Score that includes a weighting parameter beta. The default value is 1.

penguins_preds_tbl %>% 
  f_meas(truth = species, estimate = .pred_class)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 f_meas  binary         0.968

This can be changed (and a new metric function created) using the the metric_tweak() function, as shown below.

f_meas3 <- metric_tweak("f_meas3", f_meas, beta = 3)

penguins_preds_tbl %>% 
  f_meas3(truth = species, estimate = .pred_class)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 f_meas3 binary         0.943

Vectors

The second feature is the ability to apply each metric function to vectors rather than a data frame. Each metric function comes with an associated *_vec version. Whereas the “regular” metric functions require a data frame and the name of the columns to be passed as arguments, the *_vec versions allow for vectors to be passed as arguments with no data frame needed.

“Regular” Function
*_vec Function

car_prices_preds_tbl %>% 
  rmse(truth = Price, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       9577.

truth <- car_prices_preds_tbl$Price
estimate <- car_prices_preds_tbl$.pred

rmse_vec(truth = truth, estimate = estimate)

[1] 9577.075

Categorical Levels

The third feature is controlling which level of a binary outcome is considered the “positive” event, or the event that is of interest. Binary outcomes may be encoded as “0” and “1” where the “1” is the event of interest; they may also be encoded as any other set of two numbers or even two distinct words that have some specific meaning. In such cases, it may not be immediately clear which of the two levels should be treated as the desired outcome in a binary classification model.

By default, the yardstick package considers the first outcome level (whichever one appears first in the factor levels). This can be checked by running the internal function yardstick:::yardstick_event_level.

yardstick:::yardstick_event_level()

[1] "first"

Since the first level is the default, that is what is considered the “positive” event when calculating classification metrics.

For example, in the modified penguins data set used here, the “Adelie” species is listed as the first level in the factor. Therefore, when calculating the precision using “Adelie” as the event of interest, a precision of 100% is calculated.

penguins_preds_tbl %>% 
  precision(truth = species, estimate = .pred_class)

# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary             1

However, switching to the second value of the factor, “Gentoo”, as the event of interest by setting the argument event_level = "second", a different precision of 91.4% is calculated.

penguins_preds_tbl %>% 
  precision(truth = species, estimate = .pred_class, event_level = "second")

# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.914

Notes

This post is based on a presentation that was given on the date listed. It may be updated from time to time to fix errors, detail new functions, and/or remove deprecated functions so the packages and R version will likely be newer than what was available at the time.

The R session information used for this post:

sessionInfo()

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.1.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] ggplot2_3.4.0   parsnip_1.0.3   recipes_1.0.4   dplyr_1.0.10   
[5] rsample_1.1.1   yardstick_1.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9           lubridate_1.9.0      lattice_0.20-45     
 [4] tidyr_1.2.1          listenv_0.8.0        class_7.3-20        
 [7] digest_0.6.29        ipred_0.9-13         utf8_1.2.2          
[10] parallelly_1.32.1    R6_2.5.1             hardhat_1.2.0       
[13] evaluate_0.16        pillar_1.8.1         rlang_1.1.1         
[16] rstudioapi_0.14      furrr_0.3.1          rpart_4.1.16        
[19] Matrix_1.4-1         rmarkdown_2.16       labeling_0.4.2      
[22] splines_4.2.1        gower_1.0.1          stringr_1.5.0       
[25] munsell_0.5.0        compiler_4.2.1       xfun_0.40           
[28] pkgconfig_2.0.3      globals_0.16.2       htmltools_0.5.3     
[31] nnet_7.3-17          tidyselect_1.2.0     tibble_3.1.8        
[34] prodlim_2019.11.13   codetools_0.2-18     randomForest_4.7-1.1
[37] fansi_1.0.3          future_1.29.0        withr_2.5.0         
[40] MASS_7.3-57          grid_4.2.1           jsonlite_1.8.0      
[43] gtable_0.3.1         lifecycle_1.0.3      magrittr_2.0.3      
[46] scales_1.2.1         future.apply_1.10.0  cli_3.6.1           
[49] stringi_1.7.12       farver_2.1.1         renv_0.16.0         
[52] timeDate_4022.108    ellipsis_0.3.2       generics_0.1.3      
[55] vctrs_0.6.3          lava_1.7.1           tools_4.2.1         
[58] glue_1.6.2           purrr_0.3.5          parallel_4.2.1      
[61] fastmap_1.1.0        survival_3.3-1       yaml_2.3.5          
[64] timechange_0.1.1     colorspace_2.0-3     knitr_1.40