This post covers the rsample
package, which is all about data resampling.
Setup
Packages
The following packages are required:
Data
Throughout this series, I utilized the penguins data set from the modeldata package.
data("penguins", package = "modeldata")
penguins_tbl <- penguins %>%
as_tibble() %>%
filter(!if_any(everything(), is.na))
penguins_tbl
# A tibble: 333 × 7
species island bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 fema…
3 Adelie Torgersen 40.3 18 195 3250 fema…
4 Adelie Torgersen 36.7 19.3 193 3450 fema…
5 Adelie Torgersen 39.3 20.6 190 3650 male
6 Adelie Torgersen 38.9 17.8 181 3625 fema…
7 Adelie Torgersen 39.2 19.6 195 4675 male
8 Adelie Torgersen 41.1 17.6 182 3200 fema…
9 Adelie Torgersen 38.6 21.2 191 3800 male
10 Adelie Torgersen 34.6 21.1 198 4400 male
# … with 323 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Initial Data Splitting
Training and Testing
A primary use of the rsample
package is to split a data set into a training data set and a testing data set. This can be done with the initial_split()
function. Specify the percentage of the data you want in the training data set using the prop
argument.
set.seed(1914)
initial_split_obj <- initial_split(penguins_tbl, prop = 0.7)
The initial_split()
function creates an rsplit
object. Printing it shows, from left to right, the number of rows in a training data set (233), the number of rows in the testing data set (100), and the total number of rows in the original data set (333).
initial_split_obj
<Training/Testing/Total>
<233/100/333>
To extract the training data set, pass the rsplit
object to the training()
.
train_tbl <- training(initial_split_obj)
train_tbl
# A tibble: 233 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Biscoe 36.4 17.1 184 2850 fema…
2 Gentoo Biscoe 52.1 17 230 5550 male
3 Chinstrap Dream 50.8 18.5 201 4450 male
4 Adelie Dream 37.2 18.1 178 3900 male
5 Gentoo Biscoe 45.5 14.5 212 4750 fema…
6 Adelie Biscoe 39 17.5 186 3550 fema…
7 Gentoo Biscoe 48.2 15.6 221 5100 male
8 Chinstrap Dream 52 19 197 4150 male
9 Gentoo Biscoe 45.4 14.6 211 4800 fema…
10 Adelie Biscoe 38.1 17 181 3175 fema…
# … with 223 more rows, and abbreviated variable name ¹body_mass_g
To extract the testing data set, pass the rsplit
object to the testing()
function.
test_tbl <- testing(initial_split_obj)
test_tbl
# A tibble: 100 × 7
species island bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torgersen 40.3 18 195 3250 fema…
2 Adelie Torgersen 39.2 19.6 195 4675 male
3 Adelie Torgersen 38.7 19 195 3450 fema…
4 Adelie Torgersen 46 21.5 194 4200 male
5 Adelie Biscoe 37.8 18.3 174 3400 fema…
6 Adelie Biscoe 37.7 18.7 180 3600 male
7 Adelie Biscoe 38.2 18.1 185 3950 male
8 Adelie Biscoe 40.6 18.6 183 3550 male
9 Adelie Dream 36.4 17 195 3325 fema…
10 Adelie Dream 42.2 18.5 180 3550 fema…
# … with 90 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Cross-Validation
Another common resampling method is cross-validation. Cross-validation is the process by which a data set (often the training data) is split into smaller segments.
Detailed descriptions of cross-validation can be found here and here. The main idea is that one subset of the training data is used to build a model and the other subset is used to estimate the model’s performance on new data. This method allows you to not use the testing data set while still getting an out-of-sample estimate. This is often useful for processes such as tuning model hyperparameters. You can estimate how the model will perform on out-of-sample data without using the testing data set.
V-Fold
V-Fold (also called K-Fold) Cross-Validation may be the most popular form of cross-validation. The training data is randomly split into V folds of similar size. From there, a model is built on V-1 folds and its performance is estimated using the remaining fold. Additional robustness can be added to this procedure by repeating it multiple times. A good diagram of how V-Fold Cross-Validation splits data can be seen here.
Creating splits for V-Fold Cross-Validation can be done by using the vfold_cv()
function. Specify the number of folds using the v
argument and the number of times to repeat the splits using the repeats
argument.
The vfold_cv()
function creates an rset
object. Printing it shows, from left to right, the individual rsplit
objects for each fold, the id of each repeat, and the id of each fold within each repeat.
cv_splits_obj
# 10-fold cross-validation repeated 3 times
# A tibble: 30 × 3
splits id id2
<list> <chr> <chr>
1 <split [209/24]> Repeat1 Fold01
2 <split [209/24]> Repeat1 Fold02
3 <split [209/24]> Repeat1 Fold03
4 <split [210/23]> Repeat1 Fold04
5 <split [210/23]> Repeat1 Fold05
6 <split [210/23]> Repeat1 Fold06
7 <split [210/23]> Repeat1 Fold07
8 <split [210/23]> Repeat1 Fold08
9 <split [210/23]> Repeat1 Fold09
10 <split [210/23]> Repeat1 Fold10
# … with 20 more rows
New terminology is needed when extracting the data from these splits. Since we already have a training and a testing data set, the folds of the training data on which the model is built are called the analysis data, while the fold on which the performance is estimated is called the assessment data. More information can be found here.
To extract the analysis data set, pass the individual rsplit
object to the analaysis()
function.
analysis(cv_splits_obj$splits[[1]])
# A tibble: 209 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Biscoe 36.4 17.1 184 2850 fema…
2 Gentoo Biscoe 52.1 17 230 5550 male
3 Chinstrap Dream 50.8 18.5 201 4450 male
4 Adelie Dream 37.2 18.1 178 3900 male
5 Gentoo Biscoe 45.5 14.5 212 4750 fema…
6 Adelie Biscoe 39 17.5 186 3550 fema…
7 Gentoo Biscoe 48.2 15.6 221 5100 male
8 Chinstrap Dream 52 19 197 4150 male
9 Gentoo Biscoe 45.4 14.6 211 4800 fema…
10 Adelie Biscoe 38.1 17 181 3175 fema…
# … with 199 more rows, and abbreviated variable name ¹body_mass_g
To extract the assessment data set, pass the individual rsplit
object to the analaysis()
function.
assessment(cv_splits_obj$splits[[1]])
# A tibble: 24 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Chinstrap Dream 45.6 19.4 194 3525 fema…
2 Adelie Biscoe 39.6 20.7 191 3900 fema…
3 Chinstrap Dream 50.3 20 197 3300 male
4 Adelie Dream 38.8 20 190 3950 male
5 Gentoo Biscoe 49.4 15.8 216 4925 male
6 Gentoo Biscoe 50.4 15.7 222 5750 male
7 Gentoo Biscoe 43.6 13.9 217 4900 fema…
8 Chinstrap Dream 45.2 16.6 191 3250 fema…
9 Gentoo Biscoe 40.9 13.7 214 4650 fema…
10 Gentoo Biscoe 46.5 14.5 213 4400 fema…
# … with 14 more rows, and abbreviated variable name ¹body_mass_g
Leave-One-Out
Leave-One-Out Cross-Validation is a “special case” of V-Fold Cross-Validation in which the assessment data set is a single data point. Thus, the number of splits is equal to the number of data points. Since the training data set has 233 rows, there will be 233 splits, each with 232 data points in the analysis data and the remaining one data point in the assessment data.
Implementing Leave-One-Out Cross-Validation can be done using the loo_cv()
function.
loo_splits_obj <- loo_cv(train_tbl)
The loo_cv()
function also creates an rset
object. Since the resampling was not repeated, there is only one id column indicating the id of the fold.
loo_splits_obj
# Leave-one-out cross-validation
# A tibble: 233 × 2
splits id
<list> <chr>
1 <split [232/1]> Resample1
2 <split [232/1]> Resample2
3 <split [232/1]> Resample3
4 <split [232/1]> Resample4
5 <split [232/1]> Resample5
6 <split [232/1]> Resample6
7 <split [232/1]> Resample7
8 <split [232/1]> Resample8
9 <split [232/1]> Resample9
10 <split [232/1]> Resample10
# … with 223 more rows
To extract the analysis data set, pass the individual rsplit
object to the analaysis()
function.
analysis(loo_splits_obj$splits[[1]])
# A tibble: 232 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Biscoe 36.4 17.1 184 2850 fema…
2 Gentoo Biscoe 52.1 17 230 5550 male
3 Chinstrap Dream 50.8 18.5 201 4450 male
4 Adelie Dream 37.2 18.1 178 3900 male
5 Gentoo Biscoe 45.5 14.5 212 4750 fema…
6 Adelie Biscoe 39 17.5 186 3550 fema…
7 Gentoo Biscoe 48.2 15.6 221 5100 male
8 Chinstrap Dream 52 19 197 4150 male
9 Gentoo Biscoe 45.4 14.6 211 4800 fema…
10 Adelie Biscoe 38.1 17 181 3175 fema…
# … with 222 more rows, and abbreviated variable name ¹body_mass_g
To extract the assessment data set, pass the individual rsplit
object to the assessment()
function.
assessment(loo_splits_obj$splits[[1]])
# A tibble: 1 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_mas…¹ sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Biscoe 39.7 17.7 193 3200 fema…
# … with abbreviated variable name ¹body_mass_g
Monte Carlo
Monte Carlo Cross-Validation is similar to V-Fold Cross-Validation. In V-Fold Cross-Validation, the folds are mutually exclusive (i.e., taking the union of all folds would result in the original data set). On the other hand, Monte Carlo randomly samples from all the original data for each fold, and therefore each fold is not necessarily mutually exclusive.
Implementing Monte Carlo Cross-Validation can be done using the mc_cv()
function. Specify the proportion of data to be included in the analysis data using the prop
argument and the number of times to repeat the sampling (i.e., the number of resulting folds) using the times
argument.
The mc_cv()
function also creates an rset
object. Since the resampling was not repeated, there is only one id column indicating the id of the fold.
mc_splits_obj
# Monte Carlo cross-validation (0.7/0.3) with 10 resamples
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [163/70]> Resample01
2 <split [163/70]> Resample02
3 <split [163/70]> Resample03
4 <split [163/70]> Resample04
5 <split [163/70]> Resample05
6 <split [163/70]> Resample06
7 <split [163/70]> Resample07
8 <split [163/70]> Resample08
9 <split [163/70]> Resample09
10 <split [163/70]> Resample10
To extract the analysis data set, pass the individual rsplit
object to the analaysis()
function.
analysis(mc_splits_obj$splits[[1]])
# A tibble: 163 × 7
species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Gentoo Biscoe 49.9 16.1 213 5400 male
2 Adelie Dream 41.3 20.3 194 3550 male
3 Chinstrap Dream 45.7 17.3 193 3600 fema…
4 Adelie Torgersen 41.1 17.6 182 3200 fema…
5 Adelie Biscoe 39.6 20.7 191 3900 fema…
6 Adelie Biscoe 37.8 20 190 4250 male
7 Adelie Biscoe 38.6 17.2 199 3750 fema…
8 Gentoo Biscoe 44.4 17.3 219 5250 male
9 Gentoo Biscoe 55.9 17 228 5600 male
10 Gentoo Biscoe 50 15.2 218 5700 male
# … with 153 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
To extract the assessment data set, pass the individual rsplit
object to the assessment()
function.
assessment(mc_splits_obj$splits[[1]])
# A tibble: 70 × 7
species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Chinstrap Dream 52 19 197 4150 male
2 Gentoo Biscoe 47.7 15 216 4750 fema…
3 Chinstrap Dream 55.8 19.8 207 4000 male
4 Adelie Dream 36 17.9 190 3450 fema…
5 Chinstrap Dream 50 19.5 196 3900 male
6 Adelie Torgersen 39.5 17.4 186 3800 fema…
7 Adelie Torgersen 38.8 17.6 191 3275 fema…
8 Gentoo Biscoe 59.6 17 230 6050 male
9 Chinstrap Dream 51 18.8 203 4100 male
10 Adelie Dream 39.5 17.8 188 3300 fema…
# … with 60 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Splitting Time Series
Splitting data that have an explicit order in time requires a different approach than other types of data. While in the previous section random samples were used to split the data, doing so for time series would not make sense. For example, if we put last week’s data in the testing set and next week’s data in the training set, then we are building a model using next week’s data to predict last week’s. It would not make sense to do that.
Since the penguins data set does not have a time component, I will switch to using a subset of the Chicago data set from the modeldata package.
data("Chicago", package = "modeldata")
chicago_tbl <- Chicago %>%
as_tibble() %>%
select(date, ridership) %>%
filter(date >= "2012-01-01", date <= "2012-12-31")
chicago_tbl
# A tibble: 366 × 2
date ridership
<date> <dbl>
1 2012-01-01 3.88
2 2012-01-02 4.19
3 2012-01-03 17.0
4 2012-01-04 17.5
5 2012-01-05 18.4
6 2012-01-06 18.3
7 2012-01-07 5.11
8 2012-01-08 4.24
9 2012-01-09 18.6
10 2012-01-10 18.9
# … with 356 more rows
Simple
Perhaps the most straightforward way to split a time series is to simply pick a date in time. All data on or before that date will be placed in the training data and all data after that date will be placed in the testing data set. This way, you are always using the past to predict the future.
The initial_time_split()
function can be used for this procedure. Specify the proportion of the data to be placed in the training data using the prop
argument.
time_split_obj <- initial_time_split(chicago_tbl, prop = 0.7)
The initial_time_split()
function creates an rsplit
object, like the initial_split()
function, with the same interpretation. Notice above that there is no need to specify a random seed because a random sample is not used.
time_split_obj
<Training/Testing/Total>
<256/110/366>
To extract the analysis data set, pass the individual rsplit
object to the analaysis()
function.
train_ts_tbl <- training(time_split_obj)
train_ts_tbl
# A tibble: 256 × 2
date ridership
<date> <dbl>
1 2012-01-01 3.88
2 2012-01-02 4.19
3 2012-01-03 17.0
4 2012-01-04 17.5
5 2012-01-05 18.4
6 2012-01-06 18.3
7 2012-01-07 5.11
8 2012-01-08 4.24
9 2012-01-09 18.6
10 2012-01-10 18.9
# … with 246 more rows
To extract the assessment data set, pass the individual rsplit
object to the assessment()
function.
test_ts_tbl <- testing(time_split_obj)
test_ts_tbl
# A tibble: 110 × 2
date ridership
<date> <dbl>
1 2012-09-13 20.9
2 2012-09-14 20.2
3 2012-09-15 7.29
4 2012-09-16 5.65
5 2012-09-17 19.9
6 2012-09-18 21.2
7 2012-09-19 20.9
8 2012-09-20 21.3
9 2012-09-21 20.6
10 2012-09-22 5.58
# … with 100 more rows
Rolling
Often times, simply splitting the data into two groups by an arbitrary date is not good enough. There are more sophisticated methods of resmapling a time series. For example, you can do a rolling origin. The previous link provides an excellent discussion on what a rolling origin is and how it used.
There are a few ways to implement a rolling origin split. It splits a data set (e.g., the training data) into smaller chunks, like Cross-Validation. As implemented below, the first chunk contains some amount of data for the analysis data and some other amount (usually smaller, but always in chronological order) for the assessment data. The next chunk contains the same data for the analysis as the previous plus the next chronological chunk of data (which may or may not be the entirety of the previous chunk’s assessment data) and a new assessment set of the same size as the previous set. In other words, the analysis set is anchored to an initial point in time and grows with each successive chunk and the assessment data stays the same size and continues to move in time to be right after the analysis data.
A rolling origin can be implemented using the rolling_origin()
function. Specify the initial size of the analysis data set using the initial
argument and the size of the assessment data sets using the assess
argument. The skip
argument may be a bit difficult to understand at first. Reading the documentation is helpful. In the example below, skip
is set to 14 (the assessment data size minus 1) so that, for each chunk, the previous assessment set is added to the new analysis set.
Note also that there is a cumulative
argument that defaults to TRUE
. Setting to FALSE
will not anchor the rolling origin to the first data point chronologically. The analysis data set will stay at the initial
size and continue to move into the future for each successive chunk. In this way, the origin or each analysis data set truly “rolls” forward.
ro_splits_obj <- rolling_origin(
train_ts_tbl,
initial = 30,
assess = 15,
skip = 14
)
The rolling_origin()
function also creates an rset
object with each data split and an identifier.
ro_splits_obj
# Rolling origin forecast resampling
# A tibble: 15 × 2
splits id
<list> <chr>
1 <split [30/15]> Slice01
2 <split [45/15]> Slice02
3 <split [60/15]> Slice03
4 <split [75/15]> Slice04
5 <split [90/15]> Slice05
6 <split [105/15]> Slice06
7 <split [120/15]> Slice07
8 <split [135/15]> Slice08
9 <split [150/15]> Slice09
10 <split [165/15]> Slice10
11 <split [180/15]> Slice11
12 <split [195/15]> Slice12
13 <split [210/15]> Slice13
14 <split [225/15]> Slice14
15 <split [240/15]> Slice15
To extract the analysis data set, pass the individual rsplit
object to the analaysis()
function. The first split contains 30 data points and ends at 2012-01-30. The second split contains 45 data points and ends at 2012-02-14.
Split 1
analysis(ro_splits_obj$splits[[1]])
# A tibble: 30 × 2
date ridership
<date> <dbl>
1 2012-01-01 3.88
2 2012-01-02 4.19
3 2012-01-03 17.0
4 2012-01-04 17.5
5 2012-01-05 18.4
6 2012-01-06 18.3
7 2012-01-07 5.11
8 2012-01-08 4.24
9 2012-01-09 18.6
10 2012-01-10 18.9
# … with 20 more rows
Split 2
analysis(ro_splits_obj$splits[[2]])
# A tibble: 45 × 2
date ridership
<date> <dbl>
1 2012-01-01 3.88
2 2012-01-02 4.19
3 2012-01-03 17.0
4 2012-01-04 17.5
5 2012-01-05 18.4
6 2012-01-06 18.3
7 2012-01-07 5.11
8 2012-01-08 4.24
9 2012-01-09 18.6
10 2012-01-10 18.9
# … with 35 more rows
To extract the assessment data set, pass the individual rsplit
object to the assessment()
function. Both splits contain 15 data points.
Split 1
assessment(ro_splits_obj$splits[[1]])
# A tibble: 15 × 2
date ridership
<date> <dbl>
1 2012-01-31 19.6
2 2012-02-01 19.3
3 2012-02-02 19.2
4 2012-02-03 18.6
5 2012-02-04 4.77
6 2012-02-05 3.34
7 2012-02-06 18.7
8 2012-02-07 19.4
9 2012-02-08 19.4
10 2012-02-09 19.6
11 2012-02-10 18.8
12 2012-02-11 4.56
13 2012-02-12 3.74
14 2012-02-13 15.7
15 2012-02-14 19.8
Split 2
assessment(ro_splits_obj$splits[[2]])
# A tibble: 15 × 2
date ridership
<date> <dbl>
1 2012-02-15 19.9
2 2012-02-16 19.6
3 2012-02-17 19.2
4 2012-02-18 5.62
5 2012-02-19 4.30
6 2012-02-20 11.4
7 2012-02-21 20.0
8 2012-02-22 19.9
9 2012-02-23 19.9
10 2012-02-24 18.4
11 2012-02-25 5.31
12 2012-02-26 4.00
13 2012-02-27 19.3
14 2012-02-28 19.3
15 2012-02-29 19.8
Sliding
A sliding period is similar to a rolling origin. As implemented below, the sliding period breaks the training data up by month. Therefore, the first split has the first month as the analysis data and the second month as the assessment data, the second split has the second month as the analysis data and the third month as the assessment data, and so on. While the previous rolling origin resample took a static number of days for each iteration, the sliding period below takes in a specific time period (month) that may or may not have the same number of days for each iteration.
A sliding period can be implemented using the sliding_period()
function. Specify the date index using the index
argument and the period by which to split using the period
argument. There are more arguments for fine-tuning that are detailed in the documentation. Below is a very simple implementation of a sliding period.
sp_splits_obj <- sliding_period(
train_ts_tbl,
index = date,
period = "month"
)
The sliding_period()
function also creates an rset
object with each data split and an identifier.
sp_splits_obj
# Sliding period resampling
# A tibble: 8 × 2
splits id
<list> <chr>
1 <split [31/29]> Slice1
2 <split [29/31]> Slice2
3 <split [31/30]> Slice3
4 <split [30/31]> Slice4
5 <split [31/30]> Slice5
6 <split [30/31]> Slice6
7 <split [31/31]> Slice7
8 <split [31/12]> Slice8
To extract the analysis data set, pass the individual rsplit
object to the analaysis()
function. Notice that each successive analysis set is the next full calendar month.
Split 1
analysis(sp_splits_obj$splits[[1]])
# A tibble: 31 × 2
date ridership
<date> <dbl>
1 2012-01-01 3.88
2 2012-01-02 4.19
3 2012-01-03 17.0
4 2012-01-04 17.5
5 2012-01-05 18.4
6 2012-01-06 18.3
7 2012-01-07 5.11
8 2012-01-08 4.24
9 2012-01-09 18.6
10 2012-01-10 18.9
# … with 21 more rows
Split 2
analysis(sp_splits_obj$splits[[2]])
# A tibble: 29 × 2
date ridership
<date> <dbl>
1 2012-02-01 19.3
2 2012-02-02 19.2
3 2012-02-03 18.6
4 2012-02-04 4.77
5 2012-02-05 3.34
6 2012-02-06 18.7
7 2012-02-07 19.4
8 2012-02-08 19.4
9 2012-02-09 19.6
10 2012-02-10 18.8
# … with 19 more rows
To extract the assessment data set, pass the individual rsplit
object to the assessment()
function. Notice that each successive assessment set is the next full calendar month.
Split 1
assessment(sp_splits_obj$splits[[1]])
# A tibble: 29 × 2
date ridership
<date> <dbl>
1 2012-02-01 19.3
2 2012-02-02 19.2
3 2012-02-03 18.6
4 2012-02-04 4.77
5 2012-02-05 3.34
6 2012-02-06 18.7
7 2012-02-07 19.4
8 2012-02-08 19.4
9 2012-02-09 19.6
10 2012-02-10 18.8
# … with 19 more rows
Split 2
assessment(sp_splits_obj$splits[[2]])
# A tibble: 31 × 2
date ridership
<date> <dbl>
1 2012-03-01 19.8
2 2012-03-02 19.1
3 2012-03-03 5.18
4 2012-03-04 4.12
5 2012-03-05 16.8
6 2012-03-06 20.6
7 2012-03-07 20.1
8 2012-03-08 20.1
9 2012-03-09 19.0
10 2012-03-10 6.00
# … with 21 more rows
Bootstrapping
Bootstrapping is similar to Monte Carlo Cross-Validation, but with one key difference. While Monte Carlo, for each split, randomly samples the original data set without replacement, Bootstrapping randomly samples with replacement. For Monte Carlo, the analysis and assessment data sets will always be the same size for each fold, because of the lack of replacement. That is not true for Bootstrapping. It is possible for a single data point to be included multiple times in a single fold’s analysis data. When that is the case, the assessment data set, which contains all data points not chosen for the analysis data set, will be larger. Therefore, bootstrapped analysis data sets will always be the same size, but their assessment data sets will not necessarily be.
A bootstrapped resample can be implemented using the bootstraps()
function. Specify the number of splits to make using the times
argument. There are more arguments for fine-tuning that are outlined in the documentation. Below is a very simple implementation of a sliding period.
set.seed(1917)
bootstrap_splits_obj <- bootstraps(train_tbl, times = 10)
The bootstraps()
function creates an rset
object, interpreted in the same way as the others.
bootstrap_splits_obj
# Bootstrap sampling
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [233/79]> Bootstrap01
2 <split [233/84]> Bootstrap02
3 <split [233/85]> Bootstrap03
4 <split [233/78]> Bootstrap04
5 <split [233/87]> Bootstrap05
6 <split [233/91]> Bootstrap06
7 <split [233/83]> Bootstrap07
8 <split [233/89]> Bootstrap08
9 <split [233/86]> Bootstrap09
10 <split [233/74]> Bootstrap10
To extract the analysis data set, pass the individual rsplit
object to the analaysis()
function.
analysis(bootstrap_splits_obj$splits[[1]])
# A tibble: 233 × 7
species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Chinstrap Dream 50.2 18.8 202 3800 male
2 Adelie Torgersen 35.2 15.9 186 3050 fema…
3 Chinstrap Dream 50.5 18.4 200 3400 fema…
4 Adelie Biscoe 35.7 16.9 185 3150 fema…
5 Gentoo Biscoe 46.5 13.5 210 4550 fema…
6 Chinstrap Dream 45.7 17.3 193 3600 fema…
7 Adelie Dream 33.1 16.1 178 2900 fema…
8 Gentoo Biscoe 46.1 13.2 211 4500 fema…
9 Adelie Dream 38.8 20 190 3950 male
10 Chinstrap Dream 50.5 18.4 200 3400 fema…
# … with 223 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
To extract the assessment data set, pass the individual rsplit
object to the assessment()
function.
assessment(bootstrap_splits_obj$splits[[1]])
# A tibble: 79 × 7
species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Gentoo Biscoe 52.1 17 230 5550 male
2 Gentoo Biscoe 45.5 14.5 212 4750 fema…
3 Gentoo Biscoe 49.3 15.7 217 5850 male
4 Gentoo Biscoe 45.2 13.8 215 4750 fema…
5 Gentoo Biscoe 45.3 13.7 210 4300 fema…
6 Adelie Biscoe 40.5 18.9 180 3950 male
7 Adelie Dream 36 17.9 190 3450 fema…
8 Chinstrap Dream 50 19.5 196 3900 male
9 Gentoo Biscoe 49.1 14.8 220 5150 fema…
10 Adelie Torgersen 44.1 18 210 4000 male
# … with 69 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Notes
This post is based on a presentation that was given on the date listed. It may be updated from time to time to fix errors, detail new functions, and/or remove deprecated functions so the packages and R version will likely be newer than what was available at the time.
The R session information used for this post:
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.0
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] dplyr_1.0.10 rsample_1.1.1
loaded via a namespace (and not attached):
[1] pillar_1.8.1 compiler_4.2.1 tools_4.2.1 digest_0.6.29
[5] jsonlite_1.8.0 evaluate_0.16 lifecycle_1.0.3 tibble_3.1.8
[9] pkgconfig_2.0.3 rlang_1.1.1 cli_3.6.1 rstudioapi_0.14
[13] yaml_2.3.5 parallel_4.2.1 xfun_0.40 warp_0.2.0
[17] fastmap_1.1.0 furrr_0.3.1 withr_2.5.0 stringr_1.5.0
[21] knitr_1.40 generics_0.1.3 vctrs_0.6.3 globals_0.16.2
[25] tidyselect_1.2.0 glue_1.6.2 listenv_0.8.0 R6_2.5.1
[29] fansi_1.0.3 parallelly_1.32.1 rmarkdown_2.16 slider_0.3.0
[33] purrr_0.3.5 tidyr_1.2.1 magrittr_2.0.3 codetools_0.2-18
[37] htmltools_0.5.3 future_1.29.0 renv_0.16.0 utf8_1.2.2
[41] stringi_1.7.12