The rsample Package

Data Science
R
Sampling
Author

Robert Lankford

Published

January 24, 2023

This post covers the rsample package, which is all about data resampling.

Setup

Packages

The following packages are required:

Data

Throughout this series, I utilized the penguins data set from the modeldata package.

data("penguins", package = "modeldata")

penguins_tbl <- penguins %>% 
  as_tibble() %>% 
  filter(!if_any(everything(), is.na))

penguins_tbl
# A tibble: 333 × 7
   species island    bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex  
   <fct>   <fct>              <dbl>         <dbl>            <int>   <int> <fct>
 1 Adelie  Torgersen           39.1          18.7              181    3750 male 
 2 Adelie  Torgersen           39.5          17.4              186    3800 fema…
 3 Adelie  Torgersen           40.3          18                195    3250 fema…
 4 Adelie  Torgersen           36.7          19.3              193    3450 fema…
 5 Adelie  Torgersen           39.3          20.6              190    3650 male 
 6 Adelie  Torgersen           38.9          17.8              181    3625 fema…
 7 Adelie  Torgersen           39.2          19.6              195    4675 male 
 8 Adelie  Torgersen           41.1          17.6              182    3200 fema…
 9 Adelie  Torgersen           38.6          21.2              191    3800 male 
10 Adelie  Torgersen           34.6          21.1              198    4400 male 
# … with 323 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Initial Data Splitting

Training and Testing

A primary use of the rsample package is to split a data set into a training data set and a testing data set. This can be done with the initial_split() function. Specify the percentage of the data you want in the training data set using the prop argument.

set.seed(1914)
initial_split_obj <- initial_split(penguins_tbl, prop = 0.7)

The initial_split() function creates an rsplit object. Printing it shows, from left to right, the number of rows in a training data set (233), the number of rows in the testing data set (100), and the total number of rows in the original data set (333).

initial_split_obj
<Training/Testing/Total>
<233/100/333>

To extract the training data set, pass the rsplit object to the training().

train_tbl <- training(initial_split_obj)

train_tbl
# A tibble: 233 × 7
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex  
   <fct>     <fct>           <dbl>         <dbl>             <int>   <int> <fct>
 1 Adelie    Biscoe           36.4          17.1               184    2850 fema…
 2 Gentoo    Biscoe           52.1          17                 230    5550 male 
 3 Chinstrap Dream            50.8          18.5               201    4450 male 
 4 Adelie    Dream            37.2          18.1               178    3900 male 
 5 Gentoo    Biscoe           45.5          14.5               212    4750 fema…
 6 Adelie    Biscoe           39            17.5               186    3550 fema…
 7 Gentoo    Biscoe           48.2          15.6               221    5100 male 
 8 Chinstrap Dream            52            19                 197    4150 male 
 9 Gentoo    Biscoe           45.4          14.6               211    4800 fema…
10 Adelie    Biscoe           38.1          17                 181    3175 fema…
# … with 223 more rows, and abbreviated variable name ¹​body_mass_g

To extract the testing data set, pass the rsplit object to the testing() function.

test_tbl <- testing(initial_split_obj)

test_tbl
# A tibble: 100 × 7
   species island    bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex  
   <fct>   <fct>              <dbl>         <dbl>            <int>   <int> <fct>
 1 Adelie  Torgersen           40.3          18                195    3250 fema…
 2 Adelie  Torgersen           39.2          19.6              195    4675 male 
 3 Adelie  Torgersen           38.7          19                195    3450 fema…
 4 Adelie  Torgersen           46            21.5              194    4200 male 
 5 Adelie  Biscoe              37.8          18.3              174    3400 fema…
 6 Adelie  Biscoe              37.7          18.7              180    3600 male 
 7 Adelie  Biscoe              38.2          18.1              185    3950 male 
 8 Adelie  Biscoe              40.6          18.6              183    3550 male 
 9 Adelie  Dream               36.4          17                195    3325 fema…
10 Adelie  Dream               42.2          18.5              180    3550 fema…
# … with 90 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Cross-Validation

Another common resampling method is cross-validation. Cross-validation is the process by which a data set (often the training data) is split into smaller segments.

Detailed descriptions of cross-validation can be found here and here. The main idea is that one subset of the training data is used to build a model and the other subset is used to estimate the model’s performance on new data. This method allows you to not use the testing data set while still getting an out-of-sample estimate. This is often useful for processes such as tuning model hyperparameters. You can estimate how the model will perform on out-of-sample data without using the testing data set.

V-Fold

V-Fold (also called K-Fold) Cross-Validation may be the most popular form of cross-validation. The training data is randomly split into V folds of similar size. From there, a model is built on V-1 folds and its performance is estimated using the remaining fold. Additional robustness can be added to this procedure by repeating it multiple times. A good diagram of how V-Fold Cross-Validation splits data can be seen here.

Creating splits for V-Fold Cross-Validation can be done by using the vfold_cv() function. Specify the number of folds using the v argument and the number of times to repeat the splits using the repeats argument.

set.seed(1915)
cv_splits_obj <- vfold_cv(train_tbl, v = 10, repeats = 3)

The vfold_cv() function creates an rset object. Printing it shows, from left to right, the individual rsplit objects for each fold, the id of each repeat, and the id of each fold within each repeat.

cv_splits_obj
#  10-fold cross-validation repeated 3 times 
# A tibble: 30 × 3
   splits           id      id2   
   <list>           <chr>   <chr> 
 1 <split [209/24]> Repeat1 Fold01
 2 <split [209/24]> Repeat1 Fold02
 3 <split [209/24]> Repeat1 Fold03
 4 <split [210/23]> Repeat1 Fold04
 5 <split [210/23]> Repeat1 Fold05
 6 <split [210/23]> Repeat1 Fold06
 7 <split [210/23]> Repeat1 Fold07
 8 <split [210/23]> Repeat1 Fold08
 9 <split [210/23]> Repeat1 Fold09
10 <split [210/23]> Repeat1 Fold10
# … with 20 more rows

New terminology is needed when extracting the data from these splits. Since we already have a training and a testing data set, the folds of the training data on which the model is built are called the analysis data, while the fold on which the performance is estimated is called the assessment data. More information can be found here.

To extract the analysis data set, pass the individual rsplit object to the analaysis() function.

analysis(cv_splits_obj$splits[[1]])
# A tibble: 209 × 7
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex  
   <fct>     <fct>           <dbl>         <dbl>             <int>   <int> <fct>
 1 Adelie    Biscoe           36.4          17.1               184    2850 fema…
 2 Gentoo    Biscoe           52.1          17                 230    5550 male 
 3 Chinstrap Dream            50.8          18.5               201    4450 male 
 4 Adelie    Dream            37.2          18.1               178    3900 male 
 5 Gentoo    Biscoe           45.5          14.5               212    4750 fema…
 6 Adelie    Biscoe           39            17.5               186    3550 fema…
 7 Gentoo    Biscoe           48.2          15.6               221    5100 male 
 8 Chinstrap Dream            52            19                 197    4150 male 
 9 Gentoo    Biscoe           45.4          14.6               211    4800 fema…
10 Adelie    Biscoe           38.1          17                 181    3175 fema…
# … with 199 more rows, and abbreviated variable name ¹​body_mass_g

To extract the assessment data set, pass the individual rsplit object to the analaysis() function.

assessment(cv_splits_obj$splits[[1]])
# A tibble: 24 × 7
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex  
   <fct>     <fct>           <dbl>         <dbl>             <int>   <int> <fct>
 1 Chinstrap Dream            45.6          19.4               194    3525 fema…
 2 Adelie    Biscoe           39.6          20.7               191    3900 fema…
 3 Chinstrap Dream            50.3          20                 197    3300 male 
 4 Adelie    Dream            38.8          20                 190    3950 male 
 5 Gentoo    Biscoe           49.4          15.8               216    4925 male 
 6 Gentoo    Biscoe           50.4          15.7               222    5750 male 
 7 Gentoo    Biscoe           43.6          13.9               217    4900 fema…
 8 Chinstrap Dream            45.2          16.6               191    3250 fema…
 9 Gentoo    Biscoe           40.9          13.7               214    4650 fema…
10 Gentoo    Biscoe           46.5          14.5               213    4400 fema…
# … with 14 more rows, and abbreviated variable name ¹​body_mass_g

Leave-One-Out

Leave-One-Out Cross-Validation is a “special case” of V-Fold Cross-Validation in which the assessment data set is a single data point. Thus, the number of splits is equal to the number of data points. Since the training data set has 233 rows, there will be 233 splits, each with 232 data points in the analysis data and the remaining one data point in the assessment data.

Implementing Leave-One-Out Cross-Validation can be done using the loo_cv() function.

loo_splits_obj <- loo_cv(train_tbl)

The loo_cv() function also creates an rset object. Since the resampling was not repeated, there is only one id column indicating the id of the fold.

loo_splits_obj
# Leave-one-out cross-validation 
# A tibble: 233 × 2
   splits          id        
   <list>          <chr>     
 1 <split [232/1]> Resample1 
 2 <split [232/1]> Resample2 
 3 <split [232/1]> Resample3 
 4 <split [232/1]> Resample4 
 5 <split [232/1]> Resample5 
 6 <split [232/1]> Resample6 
 7 <split [232/1]> Resample7 
 8 <split [232/1]> Resample8 
 9 <split [232/1]> Resample9 
10 <split [232/1]> Resample10
# … with 223 more rows

To extract the analysis data set, pass the individual rsplit object to the analaysis() function.

analysis(loo_splits_obj$splits[[1]])
# A tibble: 232 × 7
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex  
   <fct>     <fct>           <dbl>         <dbl>             <int>   <int> <fct>
 1 Adelie    Biscoe           36.4          17.1               184    2850 fema…
 2 Gentoo    Biscoe           52.1          17                 230    5550 male 
 3 Chinstrap Dream            50.8          18.5               201    4450 male 
 4 Adelie    Dream            37.2          18.1               178    3900 male 
 5 Gentoo    Biscoe           45.5          14.5               212    4750 fema…
 6 Adelie    Biscoe           39            17.5               186    3550 fema…
 7 Gentoo    Biscoe           48.2          15.6               221    5100 male 
 8 Chinstrap Dream            52            19                 197    4150 male 
 9 Gentoo    Biscoe           45.4          14.6               211    4800 fema…
10 Adelie    Biscoe           38.1          17                 181    3175 fema…
# … with 222 more rows, and abbreviated variable name ¹​body_mass_g

To extract the assessment data set, pass the individual rsplit object to the assessment() function.

assessment(loo_splits_obj$splits[[1]])
# A tibble: 1 × 7
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mas…¹ sex  
  <fct>   <fct>           <dbl>         <dbl>             <int>      <int> <fct>
1 Adelie  Biscoe           39.7          17.7               193       3200 fema…
# … with abbreviated variable name ¹​body_mass_g

Monte Carlo

Monte Carlo Cross-Validation is similar to V-Fold Cross-Validation. In V-Fold Cross-Validation, the folds are mutually exclusive (i.e., taking the union of all folds would result in the original data set). On the other hand, Monte Carlo randomly samples from all the original data for each fold, and therefore each fold is not necessarily mutually exclusive.

Implementing Monte Carlo Cross-Validation can be done using the mc_cv() function. Specify the proportion of data to be included in the analysis data using the prop argument and the number of times to repeat the sampling (i.e., the number of resulting folds) using the times argument.

set.seed(1916)
mc_splits_obj <- mc_cv(train_tbl, prop = 0.7, times = 10)

The mc_cv() function also creates an rset object. Since the resampling was not repeated, there is only one id column indicating the id of the fold.

mc_splits_obj
# Monte Carlo cross-validation (0.7/0.3) with 10 resamples  
# A tibble: 10 × 2
   splits           id        
   <list>           <chr>     
 1 <split [163/70]> Resample01
 2 <split [163/70]> Resample02
 3 <split [163/70]> Resample03
 4 <split [163/70]> Resample04
 5 <split [163/70]> Resample05
 6 <split [163/70]> Resample06
 7 <split [163/70]> Resample07
 8 <split [163/70]> Resample08
 9 <split [163/70]> Resample09
10 <split [163/70]> Resample10

To extract the analysis data set, pass the individual rsplit object to the analaysis() function.

analysis(mc_splits_obj$splits[[1]])
# A tibble: 163 × 7
   species   island    bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex  
   <fct>     <fct>              <dbl>         <dbl>          <int>   <int> <fct>
 1 Gentoo    Biscoe              49.9          16.1            213    5400 male 
 2 Adelie    Dream               41.3          20.3            194    3550 male 
 3 Chinstrap Dream               45.7          17.3            193    3600 fema…
 4 Adelie    Torgersen           41.1          17.6            182    3200 fema…
 5 Adelie    Biscoe              39.6          20.7            191    3900 fema…
 6 Adelie    Biscoe              37.8          20              190    4250 male 
 7 Adelie    Biscoe              38.6          17.2            199    3750 fema…
 8 Gentoo    Biscoe              44.4          17.3            219    5250 male 
 9 Gentoo    Biscoe              55.9          17              228    5600 male 
10 Gentoo    Biscoe              50            15.2            218    5700 male 
# … with 153 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

To extract the assessment data set, pass the individual rsplit object to the assessment() function.

assessment(mc_splits_obj$splits[[1]])
# A tibble: 70 × 7
   species   island    bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex  
   <fct>     <fct>              <dbl>         <dbl>          <int>   <int> <fct>
 1 Chinstrap Dream               52            19              197    4150 male 
 2 Gentoo    Biscoe              47.7          15              216    4750 fema…
 3 Chinstrap Dream               55.8          19.8            207    4000 male 
 4 Adelie    Dream               36            17.9            190    3450 fema…
 5 Chinstrap Dream               50            19.5            196    3900 male 
 6 Adelie    Torgersen           39.5          17.4            186    3800 fema…
 7 Adelie    Torgersen           38.8          17.6            191    3275 fema…
 8 Gentoo    Biscoe              59.6          17              230    6050 male 
 9 Chinstrap Dream               51            18.8            203    4100 male 
10 Adelie    Dream               39.5          17.8            188    3300 fema…
# … with 60 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Splitting Time Series

Splitting data that have an explicit order in time requires a different approach than other types of data. While in the previous section random samples were used to split the data, doing so for time series would not make sense. For example, if we put last week’s data in the testing set and next week’s data in the training set, then we are building a model using next week’s data to predict last week’s. It would not make sense to do that.

Since the penguins data set does not have a time component, I will switch to using a subset of the Chicago data set from the modeldata package.

data("Chicago", package = "modeldata")

chicago_tbl <- Chicago %>% 
  as_tibble() %>% 
  select(date, ridership) %>% 
  filter(date >= "2012-01-01", date <= "2012-12-31")

chicago_tbl
# A tibble: 366 × 2
   date       ridership
   <date>         <dbl>
 1 2012-01-01      3.88
 2 2012-01-02      4.19
 3 2012-01-03     17.0 
 4 2012-01-04     17.5 
 5 2012-01-05     18.4 
 6 2012-01-06     18.3 
 7 2012-01-07      5.11
 8 2012-01-08      4.24
 9 2012-01-09     18.6 
10 2012-01-10     18.9 
# … with 356 more rows

Simple

Perhaps the most straightforward way to split a time series is to simply pick a date in time. All data on or before that date will be placed in the training data and all data after that date will be placed in the testing data set. This way, you are always using the past to predict the future.

The initial_time_split() function can be used for this procedure. Specify the proportion of the data to be placed in the training data using the prop argument.

time_split_obj <- initial_time_split(chicago_tbl, prop = 0.7)

The initial_time_split() function creates an rsplit object, like the initial_split() function, with the same interpretation. Notice above that there is no need to specify a random seed because a random sample is not used.

time_split_obj
<Training/Testing/Total>
<256/110/366>

To extract the analysis data set, pass the individual rsplit object to the analaysis() function.

train_ts_tbl <- training(time_split_obj)

train_ts_tbl
# A tibble: 256 × 2
   date       ridership
   <date>         <dbl>
 1 2012-01-01      3.88
 2 2012-01-02      4.19
 3 2012-01-03     17.0 
 4 2012-01-04     17.5 
 5 2012-01-05     18.4 
 6 2012-01-06     18.3 
 7 2012-01-07      5.11
 8 2012-01-08      4.24
 9 2012-01-09     18.6 
10 2012-01-10     18.9 
# … with 246 more rows

To extract the assessment data set, pass the individual rsplit object to the assessment() function.

test_ts_tbl <- testing(time_split_obj)

test_ts_tbl
# A tibble: 110 × 2
   date       ridership
   <date>         <dbl>
 1 2012-09-13     20.9 
 2 2012-09-14     20.2 
 3 2012-09-15      7.29
 4 2012-09-16      5.65
 5 2012-09-17     19.9 
 6 2012-09-18     21.2 
 7 2012-09-19     20.9 
 8 2012-09-20     21.3 
 9 2012-09-21     20.6 
10 2012-09-22      5.58
# … with 100 more rows

Rolling

Often times, simply splitting the data into two groups by an arbitrary date is not good enough. There are more sophisticated methods of resmapling a time series. For example, you can do a rolling origin. The previous link provides an excellent discussion on what a rolling origin is and how it used.

There are a few ways to implement a rolling origin split. It splits a data set (e.g., the training data) into smaller chunks, like Cross-Validation. As implemented below, the first chunk contains some amount of data for the analysis data and some other amount (usually smaller, but always in chronological order) for the assessment data. The next chunk contains the same data for the analysis as the previous plus the next chronological chunk of data (which may or may not be the entirety of the previous chunk’s assessment data) and a new assessment set of the same size as the previous set. In other words, the analysis set is anchored to an initial point in time and grows with each successive chunk and the assessment data stays the same size and continues to move in time to be right after the analysis data.

A rolling origin can be implemented using the rolling_origin() function. Specify the initial size of the analysis data set using the initial argument and the size of the assessment data sets using the assess argument. The skip argument may be a bit difficult to understand at first. Reading the documentation is helpful. In the example below, skip is set to 14 (the assessment data size minus 1) so that, for each chunk, the previous assessment set is added to the new analysis set.

Note also that there is a cumulative argument that defaults to TRUE. Setting to FALSE will not anchor the rolling origin to the first data point chronologically. The analysis data set will stay at the initial size and continue to move into the future for each successive chunk. In this way, the origin or each analysis data set truly “rolls” forward.

ro_splits_obj <- rolling_origin(
  train_ts_tbl,
  initial = 30,
  assess  = 15,
  skip    = 14
)

The rolling_origin() function also creates an rset object with each data split and an identifier.

ro_splits_obj
# Rolling origin forecast resampling 
# A tibble: 15 × 2
   splits           id     
   <list>           <chr>  
 1 <split [30/15]>  Slice01
 2 <split [45/15]>  Slice02
 3 <split [60/15]>  Slice03
 4 <split [75/15]>  Slice04
 5 <split [90/15]>  Slice05
 6 <split [105/15]> Slice06
 7 <split [120/15]> Slice07
 8 <split [135/15]> Slice08
 9 <split [150/15]> Slice09
10 <split [165/15]> Slice10
11 <split [180/15]> Slice11
12 <split [195/15]> Slice12
13 <split [210/15]> Slice13
14 <split [225/15]> Slice14
15 <split [240/15]> Slice15

To extract the analysis data set, pass the individual rsplit object to the analaysis() function. The first split contains 30 data points and ends at 2012-01-30. The second split contains 45 data points and ends at 2012-02-14.

Split 1

analysis(ro_splits_obj$splits[[1]])
# A tibble: 30 × 2
   date       ridership
   <date>         <dbl>
 1 2012-01-01      3.88
 2 2012-01-02      4.19
 3 2012-01-03     17.0 
 4 2012-01-04     17.5 
 5 2012-01-05     18.4 
 6 2012-01-06     18.3 
 7 2012-01-07      5.11
 8 2012-01-08      4.24
 9 2012-01-09     18.6 
10 2012-01-10     18.9 
# … with 20 more rows

Split 2

analysis(ro_splits_obj$splits[[2]])
# A tibble: 45 × 2
   date       ridership
   <date>         <dbl>
 1 2012-01-01      3.88
 2 2012-01-02      4.19
 3 2012-01-03     17.0 
 4 2012-01-04     17.5 
 5 2012-01-05     18.4 
 6 2012-01-06     18.3 
 7 2012-01-07      5.11
 8 2012-01-08      4.24
 9 2012-01-09     18.6 
10 2012-01-10     18.9 
# … with 35 more rows

To extract the assessment data set, pass the individual rsplit object to the assessment() function. Both splits contain 15 data points.

Split 1

assessment(ro_splits_obj$splits[[1]])
# A tibble: 15 × 2
   date       ridership
   <date>         <dbl>
 1 2012-01-31     19.6 
 2 2012-02-01     19.3 
 3 2012-02-02     19.2 
 4 2012-02-03     18.6 
 5 2012-02-04      4.77
 6 2012-02-05      3.34
 7 2012-02-06     18.7 
 8 2012-02-07     19.4 
 9 2012-02-08     19.4 
10 2012-02-09     19.6 
11 2012-02-10     18.8 
12 2012-02-11      4.56
13 2012-02-12      3.74
14 2012-02-13     15.7 
15 2012-02-14     19.8 

Split 2

assessment(ro_splits_obj$splits[[2]])
# A tibble: 15 × 2
   date       ridership
   <date>         <dbl>
 1 2012-02-15     19.9 
 2 2012-02-16     19.6 
 3 2012-02-17     19.2 
 4 2012-02-18      5.62
 5 2012-02-19      4.30
 6 2012-02-20     11.4 
 7 2012-02-21     20.0 
 8 2012-02-22     19.9 
 9 2012-02-23     19.9 
10 2012-02-24     18.4 
11 2012-02-25      5.31
12 2012-02-26      4.00
13 2012-02-27     19.3 
14 2012-02-28     19.3 
15 2012-02-29     19.8 

Sliding

A sliding period is similar to a rolling origin. As implemented below, the sliding period breaks the training data up by month. Therefore, the first split has the first month as the analysis data and the second month as the assessment data, the second split has the second month as the analysis data and the third month as the assessment data, and so on. While the previous rolling origin resample took a static number of days for each iteration, the sliding period below takes in a specific time period (month) that may or may not have the same number of days for each iteration.

A sliding period can be implemented using the sliding_period() function. Specify the date index using the index argument and the period by which to split using the period argument. There are more arguments for fine-tuning that are detailed in the documentation. Below is a very simple implementation of a sliding period.

sp_splits_obj <- sliding_period(
  train_ts_tbl,
  index  = date,
  period = "month"
)

The sliding_period() function also creates an rset object with each data split and an identifier.

sp_splits_obj
# Sliding period resampling 
# A tibble: 8 × 2
  splits          id    
  <list>          <chr> 
1 <split [31/29]> Slice1
2 <split [29/31]> Slice2
3 <split [31/30]> Slice3
4 <split [30/31]> Slice4
5 <split [31/30]> Slice5
6 <split [30/31]> Slice6
7 <split [31/31]> Slice7
8 <split [31/12]> Slice8

To extract the analysis data set, pass the individual rsplit object to the analaysis() function. Notice that each successive analysis set is the next full calendar month.

Split 1

analysis(sp_splits_obj$splits[[1]])
# A tibble: 31 × 2
   date       ridership
   <date>         <dbl>
 1 2012-01-01      3.88
 2 2012-01-02      4.19
 3 2012-01-03     17.0 
 4 2012-01-04     17.5 
 5 2012-01-05     18.4 
 6 2012-01-06     18.3 
 7 2012-01-07      5.11
 8 2012-01-08      4.24
 9 2012-01-09     18.6 
10 2012-01-10     18.9 
# … with 21 more rows

Split 2

analysis(sp_splits_obj$splits[[2]])
# A tibble: 29 × 2
   date       ridership
   <date>         <dbl>
 1 2012-02-01     19.3 
 2 2012-02-02     19.2 
 3 2012-02-03     18.6 
 4 2012-02-04      4.77
 5 2012-02-05      3.34
 6 2012-02-06     18.7 
 7 2012-02-07     19.4 
 8 2012-02-08     19.4 
 9 2012-02-09     19.6 
10 2012-02-10     18.8 
# … with 19 more rows

To extract the assessment data set, pass the individual rsplit object to the assessment() function. Notice that each successive assessment set is the next full calendar month.

Split 1

assessment(sp_splits_obj$splits[[1]])
# A tibble: 29 × 2
   date       ridership
   <date>         <dbl>
 1 2012-02-01     19.3 
 2 2012-02-02     19.2 
 3 2012-02-03     18.6 
 4 2012-02-04      4.77
 5 2012-02-05      3.34
 6 2012-02-06     18.7 
 7 2012-02-07     19.4 
 8 2012-02-08     19.4 
 9 2012-02-09     19.6 
10 2012-02-10     18.8 
# … with 19 more rows

Split 2

assessment(sp_splits_obj$splits[[2]])
# A tibble: 31 × 2
   date       ridership
   <date>         <dbl>
 1 2012-03-01     19.8 
 2 2012-03-02     19.1 
 3 2012-03-03      5.18
 4 2012-03-04      4.12
 5 2012-03-05     16.8 
 6 2012-03-06     20.6 
 7 2012-03-07     20.1 
 8 2012-03-08     20.1 
 9 2012-03-09     19.0 
10 2012-03-10      6.00
# … with 21 more rows

Bootstrapping

Bootstrapping is similar to Monte Carlo Cross-Validation, but with one key difference. While Monte Carlo, for each split, randomly samples the original data set without replacement, Bootstrapping randomly samples with replacement. For Monte Carlo, the analysis and assessment data sets will always be the same size for each fold, because of the lack of replacement. That is not true for Bootstrapping. It is possible for a single data point to be included multiple times in a single fold’s analysis data. When that is the case, the assessment data set, which contains all data points not chosen for the analysis data set, will be larger. Therefore, bootstrapped analysis data sets will always be the same size, but their assessment data sets will not necessarily be.

A bootstrapped resample can be implemented using the bootstraps() function. Specify the number of splits to make using the times argument. There are more arguments for fine-tuning that are outlined in the documentation. Below is a very simple implementation of a sliding period.

set.seed(1917)
bootstrap_splits_obj <- bootstraps(train_tbl, times = 10)

The bootstraps() function creates an rset object, interpreted in the same way as the others.

bootstrap_splits_obj
# Bootstrap sampling 
# A tibble: 10 × 2
   splits           id         
   <list>           <chr>      
 1 <split [233/79]> Bootstrap01
 2 <split [233/84]> Bootstrap02
 3 <split [233/85]> Bootstrap03
 4 <split [233/78]> Bootstrap04
 5 <split [233/87]> Bootstrap05
 6 <split [233/91]> Bootstrap06
 7 <split [233/83]> Bootstrap07
 8 <split [233/89]> Bootstrap08
 9 <split [233/86]> Bootstrap09
10 <split [233/74]> Bootstrap10

To extract the analysis data set, pass the individual rsplit object to the analaysis() function.

analysis(bootstrap_splits_obj$splits[[1]])
# A tibble: 233 × 7
   species   island    bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex  
   <fct>     <fct>              <dbl>         <dbl>          <int>   <int> <fct>
 1 Chinstrap Dream               50.2          18.8            202    3800 male 
 2 Adelie    Torgersen           35.2          15.9            186    3050 fema…
 3 Chinstrap Dream               50.5          18.4            200    3400 fema…
 4 Adelie    Biscoe              35.7          16.9            185    3150 fema…
 5 Gentoo    Biscoe              46.5          13.5            210    4550 fema…
 6 Chinstrap Dream               45.7          17.3            193    3600 fema…
 7 Adelie    Dream               33.1          16.1            178    2900 fema…
 8 Gentoo    Biscoe              46.1          13.2            211    4500 fema…
 9 Adelie    Dream               38.8          20              190    3950 male 
10 Chinstrap Dream               50.5          18.4            200    3400 fema…
# … with 223 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

To extract the assessment data set, pass the individual rsplit object to the assessment() function.

assessment(bootstrap_splits_obj$splits[[1]])
# A tibble: 79 × 7
   species   island    bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex  
   <fct>     <fct>              <dbl>         <dbl>          <int>   <int> <fct>
 1 Gentoo    Biscoe              52.1          17              230    5550 male 
 2 Gentoo    Biscoe              45.5          14.5            212    4750 fema…
 3 Gentoo    Biscoe              49.3          15.7            217    5850 male 
 4 Gentoo    Biscoe              45.2          13.8            215    4750 fema…
 5 Gentoo    Biscoe              45.3          13.7            210    4300 fema…
 6 Adelie    Biscoe              40.5          18.9            180    3950 male 
 7 Adelie    Dream               36            17.9            190    3450 fema…
 8 Chinstrap Dream               50            19.5            196    3900 male 
 9 Gentoo    Biscoe              49.1          14.8            220    5150 fema…
10 Adelie    Torgersen           44.1          18              210    4000 male 
# … with 69 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Notes

This post is based on a presentation that was given on the date listed. It may be updated from time to time to fix errors, detail new functions, and/or remove deprecated functions so the packages and R version will likely be newer than what was available at the time.

The R session information used for this post:

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.0

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] dplyr_1.0.10  rsample_1.1.1

loaded via a namespace (and not attached):
 [1] pillar_1.8.1      compiler_4.2.1    tools_4.2.1       digest_0.6.29    
 [5] jsonlite_1.8.0    evaluate_0.16     lifecycle_1.0.3   tibble_3.1.8     
 [9] pkgconfig_2.0.3   rlang_1.1.1       cli_3.6.1         rstudioapi_0.14  
[13] yaml_2.3.5        parallel_4.2.1    xfun_0.40         warp_0.2.0       
[17] fastmap_1.1.0     furrr_0.3.1       withr_2.5.0       stringr_1.5.0    
[21] knitr_1.40        generics_0.1.3    vctrs_0.6.3       globals_0.16.2   
[25] tidyselect_1.2.0  glue_1.6.2        listenv_0.8.0     R6_2.5.1         
[29] fansi_1.0.3       parallelly_1.32.1 rmarkdown_2.16    slider_0.3.0     
[33] purrr_0.3.5       tidyr_1.2.1       magrittr_2.0.3    codetools_0.2-18 
[37] htmltools_0.5.3   future_1.29.0     renv_0.16.0       utf8_1.2.2       
[41] stringi_1.7.12