This post covers the yardstick
package, which focuses on providing model performance estimations.
Setup
Packages
The following packages are required:
Data
Throughout this series, I utilized the penguins data set from the modeldata package, which has a categorical outcome. Additionally, to demonstrate building models with a continuous outcome, I also utilized the car_prices data set from the modeldata package in a few examples.
Referencing previous entries in this series on the rsample
and recipes
packages, a 70/30 train/test initial_split()
on both data sets is taken, a few pre-processing steps are applied on the training data sets to create recipe()
objects, and those objects are passed to the prep()
and juice()
functions. This creates a processed training data set for each original data set.
Code
data("penguins", package = "modeldata")
set.seed(1914)
penguins_split_obj <- initial_split(penguins, prop = 0.7)
penguins_recipe_obj <- recipe(species ~ ., training(penguins_split_obj)) %>%
step_select(species, island, bill_length_mm, body_mass_g) %>%
step_naomit(all_predictors()) %>%
step_filter(species %in% c("Adelie", "Gentoo")) %>%
step_mutate(
species = factor(species, levels = c("Adelie", "Gentoo")),
island = as.factor(island)
)
penguins_train_tbl <- penguins_recipe_obj %>%
prep() %>%
juice()
penguins_train_tbl
# A tibble: 194 × 4
species island bill_length_mm body_mass_g
<fct> <fct> <dbl> <int>
1 Adelie Biscoe 36.5 2850
2 Gentoo Biscoe 52.5 5450
3 Adelie Biscoe 40.6 3550
4 Gentoo Biscoe 44.9 4750
5 Adelie Biscoe 39.6 3500
6 Gentoo Biscoe 45.8 4700
7 Gentoo Biscoe 46.1 4500
8 Adelie Biscoe 37.7 3075
9 Gentoo Biscoe 45.7 4400
10 Adelie Torgersen 37.3 3775
# … with 184 more rows
Code
data("car_prices", package = "modeldata")
set.seed(1915)
car_prices_split_obj <- initial_split(car_prices, prop = 0.7)
car_prices_recipe_obj <- recipe(Price ~ ., training(car_prices_split_obj)) %>%
step_select(Price, Mileage, Doors, Leather) %>%
step_mutate(Doors = as.factor(Doors), Leather = as.factor(Leather))
car_prices_train_tbl <- car_prices_recipe_obj %>%
prep() %>%
juice()
car_prices_train_tbl
# A tibble: 562 × 4
Price Mileage Doors Leather
<dbl> <int> <fct> <fct>
1 17978. 10986 4 0
2 20512. 16633 4 0
3 16507. 17451 4 1
4 25997. 21433 4 1
5 15129. 13828 4 1
6 12965. 29707 4 1
7 30575. 22298 4 1
8 21383. 7287 4 1
9 32053. 5144 4 0
10 12208. 23512 2 1
# … with 552 more rows
Background
In the previous post on the parsnip
package, it was demonstrated how to build various models in the tidymodels
ecosystem. The question that follows from that is “are the models any good?”. While a model being “good” is somewhat subjective and dependent on the objective of the model, there are several commonly used statistics that are used to estimate model performance. The yardstick
package provides functions for calculating many of those statistics.
To use these functions, the following prerequisite steps (at a minimum) are required:
- Build a model
- Have another data set (such as a testing data set), or the original data set, on which to run the model and receive predictions
To demonstrate a few examples, two models were used: one with a continuous outcome and one with a categorical outcome.
set.seed(1914)
mod_cont_fit <- rand_forest() %>%
set_mode("regression") %>%
set_engine("randomForest") %>%
set_args(mtry = 2, trees = 250, min_n = 10) %>%
fit(Price ~ ., data = car_prices_train_tbl)
set.seed(1915)
mod_cat_fit <- rand_forest() %>%
set_mode("classification") %>%
set_engine("randomForest") %>%
set_args(mtry = 1, trees = 16, min_n = 55) %>%
fit(species ~ ., data = penguins_train_tbl)
Performance Statistics
The main purpose of the yardstick
package is to calculate performance statistics. The documentation details the large number of metrics that are available. Below, a few of the key continuous and categorical metrics are shown.
Continuous Outcome
To calculate performance statistics on a continuous outcome, predictions of those outcomes are first needed. If a recipe()
is used, first apply it to the testing data set to ensure that the data is in the correct format for the model. Then, pass the model and the testing data to the parsnip::predict.model_fit()
function to get predictions. The actual values are also needed, and bind_cols()
can be used to add the actual data to the predicted data. Note that, when using the yardstick
package, the predicted outcome is always named .pred
.
car_prices_test_tbl <- car_prices_recipe_obj %>%
prep() %>%
bake(testing(car_prices_split_obj))
car_prices_preds_tbl <- mod_cont_fit %>%
predict(car_prices_test_tbl) %>%
bind_cols(car_prices_test_tbl)
car_prices_preds_tbl
# A tibble: 242 × 5
.pred Price Mileage Doors Leather
<dbl> <dbl> <int> <fct> <fct>
1 20972. 21725. 13457 2 0
2 31168. 29143. 31655 2 1
3 20963. 30315. 23635 2 0
4 18766. 30251. 27558 2 1
5 22288. 27060. 17319 4 1
6 22028. 26841. 10003 4 0
7 20616. 24852. 22814 4 1
8 21478. 26698. 23055 4 1
9 22978. 27241. 23204 4 1
10 22049. 28416. 14613 4 1
# … with 232 more rows
Once predictions are generated, performance metrics can be calculated. In the yardstick
package, each performance metric function has the same interface. Pass the name of the column with the actual data to the truth
argument and the name of the column with the predicted data to the estimate
argument. Several examples are shown below, including:
- Root-Mean Square Error (RMSE)
- Mean Absolute Error (MAE)
- Mean Absolute Percentage Error (MAPE)
- R-Squared
Instead of calculating each metric one at a time, the yardstick::metrics()
function can be used to calculate several at one time.
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 9577.
2 rsq standard 0.0206
3 mae standard 7549.
If additional or different metrics are desired, a custom yardstick::metrics()
function can be created using the metric_set()
function that contains any desired performance metric.
cont_metrics <- metric_set(rmse, mae, mape)
car_prices_preds_tbl %>%
cont_metrics(truth = Price, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 9577.
2 mae standard 7549.
3 mape standard 41.7
Categorical Outcome
To calculate performance statistics on a categorical outcome, predictions of those outcomes are first needed. If a recipe()
is used, first apply it to the testing data set to ensure that the data is in the correct format for the model. Then, pass the model and the testing data to the predict.model_fit()
function to get predictions. The actual values are also needed, and bind_cols()
can be used to add the actual data to the predicted data. Note that, when using the yardstick
package, the predicted outcome is always named .pred_class
and is the predicted outcome class.
penguins_test_tbl <- penguins_recipe_obj %>%
prep() %>%
bake(testing(penguins_split_obj))
penguins_preds_tbl <- mod_cat_fit %>%
predict(penguins_test_tbl) %>%
bind_cols(penguins_test_tbl)
penguins_preds_tbl
# A tibble: 104 × 5
.pred_class species island bill_length_mm body_mass_g
<fct> <fct> <fct> <dbl> <int>
1 Adelie Adelie Torgersen 40.3 3250
2 Adelie Adelie Torgersen 38.9 3625
3 Adelie Adelie Torgersen 37.8 3700
4 Adelie Adelie Torgersen 34.6 4400
5 Adelie Adelie Torgersen 36.6 3700
6 Adelie Adelie Torgersen 38.7 3450
7 Adelie Adelie Torgersen 34.4 3325
8 Adelie Adelie Biscoe 37.7 3600
9 Adelie Adelie Biscoe 40.5 3950
10 Adelie Adelie Dream 39.5 3300
# … with 94 more rows
A common representation used to visualize the performance of a model with a categorical outcome is the confusion matrix. This can quickly be done by passing the predicted and actuals to the conf_mat()
function, specifying the name of the column of actuals as the truth
argument and the name of the column of predictions as the estimate
argument.
Truth
Prediction Adelie Gentoo
Adelie 45 0
Gentoo 3 32
Once predictions are generated, performance metrics can be calculated. As with the continuous predictions, pass the name of the column with the actual data to the truth
argument and the name of the column with the predicted data to the estimate
argument. Several examples are shown below, including:
penguins_preds_tbl %>%
sensitivity(truth = species, estimate = .pred_class)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sensitivity binary 0.938
penguins_preds_tbl %>%
specificity(truth = species, estimate = .pred_class)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 specificity binary 1
As with the continuous metrics, instead of calculating each metric one at a time, the yardstick::metrics()
function can be used to calculate several at one time.
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.962
2 kap binary 0.923
If additional or different metrics are desired, a custom yardstick::metrics()
function can be created using the metric_set()
function that contains any desired performance metric.
cat_metrics <- metric_set(sens, spec, accuracy)
penguins_preds_tbl %>%
cat_metrics(truth = species, estimate = .pred_class)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sens binary 0.938
2 spec binary 1
3 accuracy binary 0.962
For the categorical metrics, many of them can be calculated directly from the confusion matrix. Pass the confusion matrix to the summary()
function to quickly calculate several performance metrics.
# A tibble: 13 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.962
2 kap binary 0.923
3 sens binary 0.938
4 spec binary 1
5 ppv binary 1
6 npv binary 0.914
7 mcc binary 0.926
8 j_index binary 0.938
9 bal_accuracy binary 0.969
10 detection_prevalence binary 0.562
11 precision binary 1
12 recall binary 0.938
13 f_meas binary 0.968
Performance Plots
The yardstick
package also contains several plotting methods to quickly create nice visual representations of model performance.
Confusion Matrix
By passing the output of conf_mat()
to the autoplot()
function, a nicer representation of the confusion matrix will be created as a ggplot
object.
Performance Curves
Another class of plots are the Receiver Operating Characteristic (ROC) Curve and the Precision-Recall (PR) Curve. These curves help to determine the “optimal” probability threshold (i.e., at what predicted probability should an observation be classified into a specific binary outcome class relative to the other?). The predicted probability of belonging to an outcome class, rather than the predicted outcome class itself, can be found by passing the model object and a data set to the parsnip::predict.model_fit()
function and specifying the argument type = "prob"
.
The ROC Curve and PR Curve can be plotting using the roc_curve()
and pr_curve()
functions, respectively, and passing the result to the autoplot()
function.
Additionally, the Area Under the ROC Curve (AUC) and Area Under the PR Curve (AUPRC) can also be calculated using the roc_auc()
and pr_auc()
functions, respectively. These two methods are another performance metrics used to measure model performance.
Other Features
The previous sections cover many of the common uses of the yardstick
package. There are a few additional features that can come in handy as well.
Metric Parameters
The first feature is changing the default values of metric functions. Certain metrics are only a function of the underlying data itself. Other metrics have parameters that must be supplied. One example is the F-Score that includes a weighting parameter beta. The default value is 1.
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 f_meas binary 0.968
This can be changed (and a new metric function created) using the the metric_tweak()
function, as shown below.
f_meas3 <- metric_tweak("f_meas3", f_meas, beta = 3)
penguins_preds_tbl %>%
f_meas3(truth = species, estimate = .pred_class)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 f_meas3 binary 0.943
Vectors
The second feature is the ability to apply each metric function to vectors rather than a data frame. Each metric function comes with an associated *_vec
version. Whereas the “regular” metric functions require a data frame and the name of the columns to be passed as arguments, the *_vec
versions allow for vectors to be passed as arguments with no data frame needed.
Categorical Levels
The third feature is controlling which level of a binary outcome is considered the “positive” event, or the event that is of interest. Binary outcomes may be encoded as “0” and “1” where the “1” is the event of interest; they may also be encoded as any other set of two numbers or even two distinct words that have some specific meaning. In such cases, it may not be immediately clear which of the two levels should be treated as the desired outcome in a binary classification model.
By default, the yardstick
package considers the first outcome level (whichever one appears first in the factor levels). This can be checked by running the internal function yardstick:::yardstick_event_level
.
yardstick:::yardstick_event_level()
[1] "first"
Since the first level is the default, that is what is considered the “positive” event when calculating classification metrics.
For example, in the modified penguins data set used here, the “Adelie” species is listed as the first level in the factor. Therefore, when calculating the precision using “Adelie” as the event of interest, a precision of 100% is calculated.
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 1
However, switching to the second value of the factor, “Gentoo”, as the event of interest by setting the argument event_level = "second"
, a different precision of 91.4% is calculated.
Notes
This post is based on a presentation that was given on the date listed. It may be updated from time to time to fix errors, detail new functions, and/or remove deprecated functions so the packages and R version will likely be newer than what was available at the time.
The R session information used for this post:
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.1.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] ggplot2_3.4.0 parsnip_1.0.3 recipes_1.0.4 dplyr_1.0.10
[5] rsample_1.1.1 yardstick_1.1.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 lubridate_1.9.0 lattice_0.20-45
[4] tidyr_1.2.1 listenv_0.8.0 class_7.3-20
[7] digest_0.6.29 ipred_0.9-13 utf8_1.2.2
[10] parallelly_1.32.1 R6_2.5.1 hardhat_1.2.0
[13] evaluate_0.16 pillar_1.8.1 rlang_1.1.1
[16] rstudioapi_0.14 furrr_0.3.1 rpart_4.1.16
[19] Matrix_1.4-1 rmarkdown_2.16 labeling_0.4.2
[22] splines_4.2.1 gower_1.0.1 stringr_1.5.0
[25] munsell_0.5.0 compiler_4.2.1 xfun_0.40
[28] pkgconfig_2.0.3 globals_0.16.2 htmltools_0.5.3
[31] nnet_7.3-17 tidyselect_1.2.0 tibble_3.1.8
[34] prodlim_2019.11.13 codetools_0.2-18 randomForest_4.7-1.1
[37] fansi_1.0.3 future_1.29.0 withr_2.5.0
[40] MASS_7.3-57 grid_4.2.1 jsonlite_1.8.0
[43] gtable_0.3.1 lifecycle_1.0.3 magrittr_2.0.3
[46] scales_1.2.1 future.apply_1.10.0 cli_3.6.1
[49] stringi_1.7.12 farver_2.1.1 renv_0.16.0
[52] timeDate_4022.108 ellipsis_0.3.2 generics_0.1.3
[55] vctrs_0.6.3 lava_1.7.1 tools_4.2.1
[58] glue_1.6.2 purrr_0.3.5 parallel_4.2.1
[61] fastmap_1.1.0 survival_3.3-1 yaml_2.3.5
[64] timechange_0.1.1 colorspace_2.0-3 knitr_1.40