This post covers the recipes package, which is all about feature engineering.

Setup

Packages

The following packages are required:

library(recipes)
library(rsample)
library(dplyr)

Data

Throughout this series, I utilized the penguins data set from the modeldata package.

data("penguins", package = "modeldata")

penguins_tbl <- as_tibble(penguins)

penguins_tbl

# A tibble: 344 × 7
   species island    bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex  
   <fct>   <fct>              <dbl>         <dbl>            <int>   <int> <fct>
 1 Adelie  Torgersen           39.1          18.7              181    3750 male 
 2 Adelie  Torgersen           39.5          17.4              186    3800 fema…
 3 Adelie  Torgersen           40.3          18                195    3250 fema…
 4 Adelie  Torgersen           NA            NA                 NA      NA <NA> 
 5 Adelie  Torgersen           36.7          19.3              193    3450 fema…
 6 Adelie  Torgersen           39.3          20.6              190    3650 male 
 7 Adelie  Torgersen           38.9          17.8              181    3625 fema…
 8 Adelie  Torgersen           39.2          19.6              195    4675 male 
 9 Adelie  Torgersen           34.1          18.1              193    3475 <NA> 
10 Adelie  Torgersen           42            20.2              190    4250 <NA> 
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#   ²body_mass_g

From the previous post on the rsample package, we take a 70/30 train/test initial_split() and extract the training() and testing()data sets.

set.seed(1914)
initial_split_obj <- initial_split(penguins_tbl, prop = 0.7)

train_tbl <- training(initial_split_obj)
test_tbl  <- testing(initial_split_obj)

train_tbl

# A tibble: 240 × 7
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex  
   <fct>     <fct>           <dbl>         <dbl>             <int>   <int> <fct>
 1 Adelie    Biscoe           36.5          16.6               181    2850 fema…
 2 Gentoo    Biscoe           52.5          15.6               221    5450 male 
 3 Chinstrap Dream            49.7          18.6               195    3600 male 
 4 Adelie    Biscoe           40.6          18.6               183    3550 male 
 5 Gentoo    Biscoe           44.9          13.8               212    4750 fema…
 6 Adelie    Biscoe           39.6          17.7               186    3500 fema…
 7 Gentoo    Biscoe           45.8          14.2               219    4700 fema…
 8 Chinstrap Dream            45.9          17.1               190    3575 fema…
 9 Gentoo    Biscoe           46.1          13.2               211    4500 fema…
10 Adelie    Biscoe           37.7          16                 183    3075 fema…
# … with 230 more rows, and abbreviated variable name ¹body_mass_g

test_tbl

# A tibble: 104 × 7
   species island    bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex  
   <fct>   <fct>              <dbl>         <dbl>            <int>   <int> <fct>
 1 Adelie  Torgersen           40.3          18                195    3250 fema…
 2 Adelie  Torgersen           38.9          17.8              181    3625 fema…
 3 Adelie  Torgersen           37.8          17.3              180    3700 <NA> 
 4 Adelie  Torgersen           34.6          21.1              198    4400 male 
 5 Adelie  Torgersen           36.6          17.8              185    3700 fema…
 6 Adelie  Torgersen           38.7          19                195    3450 fema…
 7 Adelie  Torgersen           34.4          18.4              184    3325 fema…
 8 Adelie  Biscoe              37.7          18.7              180    3600 male 
 9 Adelie  Biscoe              40.5          18.9              180    3950 male 
10 Adelie  Dream               39.5          17.8              188    3300 fema…
# … with 94 more rows, and abbreviated variable names ¹flipper_length_mm,
#   ²body_mass_g

Basic Functionality

The recipes package provides methods for chaining together data pre-processing steps using a pipe (e.g., %>%), in much the same way that the dplyr package provides data wrangling verbs that can be piped together to combine multiple steps into one coherent, easy-to-read statement.

To begin a recipe, use the recipe() function and provide as arguments the model formula (e.g., as if you were supplying a formula() to the lm() function) and the data set you want to process (typically, the training data). Printing out the recipe shows how many outcome (response) and predictor variables you have specified in your model formula. The recipes package applies “roles” to each variable based on the model formula you specify.

recipe_obj <- recipe(species ~ ., train_tbl)

recipe_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Once the recipe object is created, sequential pre-processing steps can begin being added to it. After all steps have been added, the recipe is passed to the prep() function, which will calculate any parameters required by the pre-processing steps to prepare them to be applied to a data set. Printing out a recipe object at this point also adds additional information such as the number of data points and how many rows contain at least one missing value.

recipe_obj %>% 
  prep()

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Training data contained 240 data points and 9 incomplete rows.

Once a recipe has been prepared, it can be applied to a data set. To apply the pre-processing steps to the data set originally supplied to the recipe() function, use the juice() function, which applies the steps and extracts the transformed data set. To apply the pre-processing steps to any other data set, use the bake() function, supplying the data set to the new_data argument.

recipe_obj %>% 
  prep() %>% 
  juice()

# A tibble: 240 × 7
   island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex   species
   <fct>           <dbl>         <dbl>             <int>     <int> <fct> <fct>  
 1 Biscoe           36.5          16.6               181      2850 fema… Adelie 
 2 Biscoe           52.5          15.6               221      5450 male  Gentoo 
 3 Dream            49.7          18.6               195      3600 male  Chinst…
 4 Biscoe           40.6          18.6               183      3550 male  Adelie 
 5 Biscoe           44.9          13.8               212      4750 fema… Gentoo 
 6 Biscoe           39.6          17.7               186      3500 fema… Adelie 
 7 Biscoe           45.8          14.2               219      4700 fema… Gentoo 
 8 Dream            45.9          17.1               190      3575 fema… Chinst…
 9 Biscoe           46.1          13.2               211      4500 fema… Gentoo 
10 Biscoe           37.7          16                 183      3075 fema… Adelie 
# … with 230 more rows, and abbreviated variable name ¹body_mass_g

recipe_obj %>% 
  prep() %>% 
  bake(new_data = test_tbl)

# A tibble: 104 × 7
   island    bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex   species
   <fct>              <dbl>         <dbl>            <int>   <int> <fct> <fct>  
 1 Torgersen           40.3          18                195    3250 fema… Adelie 
 2 Torgersen           38.9          17.8              181    3625 fema… Adelie 
 3 Torgersen           37.8          17.3              180    3700 <NA>  Adelie 
 4 Torgersen           34.6          21.1              198    4400 male  Adelie 
 5 Torgersen           36.6          17.8              185    3700 fema… Adelie 
 6 Torgersen           38.7          19                195    3450 fema… Adelie 
 7 Torgersen           34.4          18.4              184    3325 fema… Adelie 
 8 Biscoe              37.7          18.7              180    3600 male  Adelie 
 9 Biscoe              40.5          18.9              180    3950 male  Adelie 
10 Dream               39.5          17.8              188    3300 fema… Adelie 
# … with 94 more rows, and abbreviated variable names ¹flipper_length_mm,
#   ²body_mass_g

Missing Values

One of the most common data pre-processing steps is dealing with missing values. For the penguins data set, both the training and the testing data sets have missing values.

train_tbl %>% 
  filter(if_any(everything(), is.na))

# A tibble: 9 × 7
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex  
  <fct>   <fct>              <dbl>         <dbl>             <int>   <int> <fct>
1 Adelie  Torgersen           NA            NA                  NA      NA <NA> 
2 Gentoo  Biscoe              44.5          14.3               216    4100 <NA> 
3 Adelie  Torgersen           34.1          18.1               193    3475 <NA> 
4 Gentoo  Biscoe              NA            NA                  NA      NA <NA> 
5 Adelie  Torgersen           37.8          17.1               186    3300 <NA> 
6 Gentoo  Biscoe              44.5          15.7               217    4875 <NA> 
7 Adelie  Torgersen           42            20.2               190    4250 <NA> 
8 Adelie  Dream               37.5          18.9               179    2975 <NA> 
9 Gentoo  Biscoe              47.3          13.8               216    4725 <NA> 
# … with abbreviated variable name ¹body_mass_g

test_tbl %>% 
  filter(if_any(everything(), is.na))

# A tibble: 2 × 7
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex  
  <fct>   <fct>              <dbl>         <dbl>             <int>   <int> <fct>
1 Adelie  Torgersen           37.8          17.3               180    3700 <NA> 
2 Gentoo  Biscoe              46.2          14.4               214    4650 <NA> 
# … with abbreviated variable name ¹body_mass_g

The recipes package contains several steps to deal with missing values. The easiest method is to simply remove all rows with missing values using step_naomit(). The example below removes rows in the training data that have NA in the bill_length_mm column. Adding steps to a recipe and then printing the result shows a running list of pre-processing steps (in the order that they will be applied and to which column(s)) under “Operations:”.

recipe_01_obj <- recipe_obj %>% 
  step_naomit(bill_length_mm)

recipe_01_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Operations:

Removing rows with NA values in bill_length_mm

Passing this new recipe to the prep() and juice() functions shows fewer rows than were in the training data set, demonstrating that the rows with missing values were in fact removed.

recipe_01_obj %>% 
  prep() %>% 
  juice()

# A tibble: 238 × 7
   island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex   species
   <fct>           <dbl>         <dbl>             <int>     <int> <fct> <fct>  
 1 Biscoe           36.5          16.6               181      2850 fema… Adelie 
 2 Biscoe           52.5          15.6               221      5450 male  Gentoo 
 3 Dream            49.7          18.6               195      3600 male  Chinst…
 4 Biscoe           40.6          18.6               183      3550 male  Adelie 
 5 Biscoe           44.9          13.8               212      4750 fema… Gentoo 
 6 Biscoe           39.6          17.7               186      3500 fema… Adelie 
 7 Biscoe           45.8          14.2               219      4700 fema… Gentoo 
 8 Dream            45.9          17.1               190      3575 fema… Chinst…
 9 Biscoe           46.1          13.2               211      4500 fema… Gentoo 
10 Biscoe           37.7          16                 183      3075 fema… Adelie 
# … with 228 more rows, and abbreviated variable name ¹body_mass_g

While simply removing the missing values is an easy approach to dealing with them, there are times when removing data points is not desired. In those cases, the solution is often to impute the missing values; that is, estimate what those missing values would have been had they not been missing. The recipes package offers several ways to impute missing values. In the recipe below, missing values are imputed using:

step_impute_mean(): the mean of the other values for that feature (a numeric feature only)
step_impute_median(): the median of the other values for that feature (a numeric feature only)
step_impute_knn(): a K-Nearest Neighbors model (using the other features as predictors)
step_impute_linear(): a Linear Regression (using only the island feature by specifying it with the imp_vars() function in the impute_with argument)
step_impute_bag(): a Bagged Tree Model (using all other features as predictors)

Printing out the new recipe shows these steps and the features to which they are applied.

recipe_02_obj <- recipe_obj %>% 
  step_impute_mean(flipper_length_mm) %>% 
  step_impute_median(body_mass_g) %>% 
  
  step_impute_knn(bill_length_mm) %>% 
  step_impute_linear(bill_depth_mm, impute_with = imp_vars(island)) %>% 
  step_impute_bag(sex)

recipe_02_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Operations:

Mean imputation for flipper_length_mm
Median imputation for body_mass_g
K-nearest neighbor imputation for bill_length_mm
Linear regression imputation for bill_depth_mm
Bagged tree imputation for sex

Passing this new recipe to the prep() and juice() functions, and searching for the rows in the training data that had missing values, shows that the formerly missing values have now been imputed.

train_tbl %>% 
  mutate(row_id = row_number()) %>% 
  filter(if_any(everything(), is.na))

# A tibble: 9 × 8
  species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex   row_id
  <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct>  <int>
1 Adelie  Torgersen           NA            NA           NA      NA <NA>      35
2 Gentoo  Biscoe              44.5          14.3        216    4100 <NA>      38
3 Adelie  Torgersen           34.1          18.1        193    3475 <NA>      43
4 Gentoo  Biscoe              NA            NA           NA      NA <NA>      76
5 Adelie  Torgersen           37.8          17.1        186    3300 <NA>     124
6 Gentoo  Biscoe              44.5          15.7        217    4875 <NA>     143
7 Adelie  Torgersen           42            20.2        190    4250 <NA>     161
8 Adelie  Dream               37.5          18.9        179    2975 <NA>     169
9 Gentoo  Biscoe              47.3          13.8        216    4725 <NA>     170
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g

na_rows_int <- train_tbl %>% 
  mutate(row_id = row_number()) %>% 
  filter(if_any(everything(), is.na)) %>% 
  pull(row_id)

recipe_02_obj %>% 
  prep() %>% 
  juice() %>% 
  
  mutate(row_id = row_number()) %>% 
  filter(row_id %in% na_rows_int) %>% 
  select(species, everything())

# A tibble: 9 × 8
  species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex   row_id
  <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct>  <int>
1 Adelie  Torgersen           41.5          18.4        201    4025 male      35
2 Gentoo  Biscoe              44.5          14.3        216    4100 fema…     38
3 Adelie  Torgersen           34.1          18.1        193    3475 fema…     43
4 Gentoo  Biscoe              40.7          15.8        201    4025 fema…     76
5 Adelie  Torgersen           37.8          17.1        186    3300 fema…    124
6 Gentoo  Biscoe              44.5          15.7        217    4875 fema…    143
7 Adelie  Torgersen           42            20.2        190    4250 male     161
8 Adelie  Dream               37.5          18.9        179    2975 fema…    169
9 Gentoo  Biscoe              47.3          13.8        216    4725 fema…    170
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g

Individual Transformations

After handling missing values, another extremely common data pre-processing step is applying transformations to individual features. Some typical ones include:

Square Root: step_sqrt()
Logarithm: step_log()
Inverse: step_inverse()

These transformations were applied in the example below. Printing out this recipe show the steps applied and to which feature. Notice that missing values are first removed, and specifically missing values in all_numeric_predictors(). This function is a quick way to apply a step to, as its name would imply, all numeric predictors. While this may not save much time in a small data set like the one used here, it is quite convenient when a data set has dozens, hundreds, or thousands of features.

recipe_03_obj <- recipe_obj %>% 
  step_naomit(all_numeric_predictors()) %>% 
  
  step_sqrt(bill_length_mm) %>% 
  step_log(bill_depth_mm) %>% 
  step_log(flipper_length_mm) %>% 
  step_inverse(body_mass_g)

recipe_03_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Operations:

Removing rows with NA values in all_numeric_predictors()
Square root transformation on bill_length_mm
Log transformation on bill_depth_mm
Log transformation on flipper_length_mm
Inverse transformation on body_mass_g

Passing this new recipe to the prep() and juice() functions shows that the selected features were transformed.

recipe_03_obj %>% 
  prep() %>% 
  juice()

# A tibble: 238 × 7
   island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex   species
   <fct>           <dbl>         <dbl>             <dbl>     <dbl> <fct> <fct>  
 1 Biscoe           6.04          2.81              5.20  0.000351 fema… Adelie 
 2 Biscoe           7.25          2.75              5.40  0.000183 male  Gentoo 
 3 Dream            7.05          2.92              5.27  0.000278 male  Chinst…
 4 Biscoe           6.37          2.92              5.21  0.000282 male  Adelie 
 5 Biscoe           6.70          2.62              5.36  0.000211 fema… Gentoo 
 6 Biscoe           6.29          2.87              5.23  0.000286 fema… Adelie 
 7 Biscoe           6.77          2.65              5.39  0.000213 fema… Gentoo 
 8 Dream            6.77          2.84              5.25  0.000280 fema… Chinst…
 9 Biscoe           6.79          2.58              5.35  0.000222 fema… Gentoo 
10 Biscoe           6.14          2.77              5.21  0.000325 fema… Adelie 
# … with 228 more rows, and abbreviated variable name ¹body_mass_g

Like with the step_impute_linear() step used earlier, some of the individual transformation functions also provide additional arguments to change the functionality. For example, the step_log() step has arguments for offset (which is a value added to a data point before applying a log, is used to avoid taking the log of zero, and defaults to 0) and base (which is used to control the base of the logarithm and defaults to exp(1), the natural logarithm). Additionally, step_inverse() has an offset argument, which operates in the same way and is used to avoid trying to calculate 1 / 0.

recipe_04_obj <- recipe_obj %>% 
  step_naomit(all_numeric_predictors()) %>% 
  
  step_sqrt(bill_length_mm) %>% 
  step_log(bill_depth_mm, offset = 1) %>% 
  step_log(flipper_length_mm, base = 10) %>% 
  step_inverse(body_mass_g, offset = 100)

recipe_04_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Operations:

Removing rows with NA values in all_numeric_predictors()
Square root transformation on bill_length_mm
Log transformation on bill_depth_mm
Log transformation on flipper_length_mm
Inverse transformation on body_mass_g

Passing this new recipe to the prep() and juice() functions shows different results for bill_depth_mm, flipper_length_mm, and body_mass_g because the additional arguments were used.

recipe_04_obj %>% 
  prep() %>% 
  juice()

# A tibble: 238 × 7
   island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex   species
   <fct>           <dbl>         <dbl>             <dbl>     <dbl> <fct> <fct>  
 1 Biscoe           6.04          2.87              2.26  0.000339 fema… Adelie 
 2 Biscoe           7.25          2.81              2.34  0.000180 male  Gentoo 
 3 Dream            7.05          2.98              2.29  0.000270 male  Chinst…
 4 Biscoe           6.37          2.98              2.26  0.000274 male  Adelie 
 5 Biscoe           6.70          2.69              2.33  0.000206 fema… Gentoo 
 6 Biscoe           6.29          2.93              2.27  0.000278 fema… Adelie 
 7 Biscoe           6.77          2.72              2.34  0.000208 fema… Gentoo 
 8 Dream            6.77          2.90              2.28  0.000272 fema… Chinst…
 9 Biscoe           6.79          2.65              2.32  0.000217 fema… Gentoo 
10 Biscoe           6.14          2.83              2.26  0.000315 fema… Adelie 
# … with 228 more rows, and abbreviated variable name ¹body_mass_g

Another common individual transformation is the Box Cox transformation. The Box Cox transformation is a power transformation that requires all values to be positive and is typically used to transform a non-normally distributed variable into a normally distributed one. It relies on a parameter lambda that dictates which power transformation will be done.

Lambda	Transformation
-3	Inverse Cube
-2	Inverse Square
-1	Inverse
-0.5	Inverse Square Root
0	Logarithm
0.5	Square Root
1	No Transformation
2	Square
3	Cube

The recipes package provides the step_BoxCox() step to apply a Box Cox transformation to one or more features. The lambdas argument can be left blank for the function to select an optimal lambda, or a value can be passed to the argument. By default, the function searches all lambdas between -5 and 5, but different limits can be provided to the limits argument.

recipe_05_obj <- recipe_obj %>% 
  step_naomit(flipper_length_mm) %>% 
  
  step_BoxCox(bill_length_mm) %>% 
  step_BoxCox(bill_depth_mm, lambdas = 3) %>% 
  step_BoxCox(flipper_length_mm, limits = c(-3, 0))

recipe_05_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Operations:

Removing rows with NA values in flipper_length_mm
Box-Cox transformation on bill_length_mm
Box-Cox transformation on bill_depth_mm
Box-Cox transformation on flipper_length_mm

Passing this new recipe to the prep() and juice() functions shows different results for bill_length_mm, bill_depth_mm, and flipper_length_mm because the default arguments were changed.

recipe_05_obj %>% 
  prep() %>% 
  juice()

# A tibble: 238 × 7
   island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex   species
   <fct>           <dbl>         <dbl>             <dbl>     <int> <fct> <fct>  
 1 Biscoe           12.4          13.7             0.661      2850 fema… Adelie 
 2 Biscoe           15.8          12.9             0.662      5450 male  Gentoo 
 3 Dream            15.2          15.4             0.661      3600 male  Chinst…
 4 Biscoe           13.3          15.4             0.661      3550 male  Adelie 
 5 Biscoe           14.2          11.4             0.661      4750 fema… Gentoo 
 6 Biscoe           13.1          14.6             0.661      3500 fema… Adelie 
 7 Biscoe           14.4          11.7             0.662      4700 fema… Gentoo 
 8 Dream            14.4          14.1             0.661      3575 fema… Chinst…
 9 Biscoe           14.5          10.9             0.661      4500 fema… Gentoo 
10 Biscoe           12.7          13.2             0.661      3075 fema… Adelie 
# … with 228 more rows, and abbreviated variable name ¹body_mass_g

Normalization

Another common transformation for continuous variables is to apply is normalization. Usually, this process results in the transformed variable having a mean of 0 and a standard deviation of 1, and incorporates two steps:

Subtract the average of the variable from each data point
Divide each data point by the standard deviation of the variable

Centering the variable (subtracting its mean) can be accomplished using step_center(). Scaling the variable (dividing by its standard deviation) can be accomplished using step_scale(). Note that the default value of the factor argument is 1, which scales the variable to have a standard deviation of 1. A value of 2 can also be used to scale the variable to have a standard deviation of 2. The step_normalize() function can be used to combine the steps, resulting in a mean of 0 and a standard deviation of 1.

Normalization has several definitions, all of which revolve around some form of adjusting the values to different scales. Adjusting to a custom scale can be done with step_range(). The min and max arguments define the range within which the variable will be scaled.

recipe_06_obj <- recipe_obj %>% 
  step_naomit(all_predictors()) %>% 
  
  step_center(bill_length_mm) %>% 
  step_scale(bill_depth_mm, factor = 2) %>% 
  step_normalize(flipper_length_mm) %>% 
  step_range(body_mass_g, min = -1, max = 1)

recipe_06_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Operations:

Removing rows with NA values in all_predictors()
Centering for bill_length_mm
Scaling for bill_depth_mm
Centering and scaling for flipper_length_mm
Range scaling to [-1,1] for body_mass_g

Passing this new recipe to the prep() and juice() functions shows the various transformations for bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. Notice that three of the variables now have negative values due to the scaling.

recipe_06_obj %>% 
  prep() %>% 
  juice()

# A tibble: 231 × 7
   island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex   species
   <fct>           <dbl>         <dbl>             <dbl>     <dbl> <fct> <fct>  
 1 Biscoe         -7.48           4.16            -1.41     -0.917 fema… Adelie 
 2 Biscoe          8.52           3.91             1.36      0.528 male  Gentoo 
 3 Dream           5.72           4.67            -0.440    -0.5   male  Chinst…
 4 Biscoe         -3.38           4.67            -1.27     -0.528 male  Adelie 
 5 Biscoe          0.916          3.46             0.740     0.139 fema… Gentoo 
 6 Biscoe         -4.38           4.44            -1.06     -0.556 fema… Adelie 
 7 Biscoe          1.82           3.56             1.23      0.111 fema… Gentoo 
 8 Dream           1.92           4.29            -0.787    -0.514 fema… Chinst…
 9 Biscoe          2.12           3.31             0.670     0     fema… Gentoo 
10 Biscoe         -6.28           4.01            -1.27     -0.792 fema… Adelie 
# … with 221 more rows, and abbreviated variable name ¹body_mass_g

Discretization & Dummy Variables

The previous steps focused on transforming continuous variables. Categorical (also called nominal, discrete, etc.) variables can also be pre-processed. Firstly, continuous variables can be transformed into categorical variables using discretization. This is usually done by “cutting” the variable into continuous intervals. The recipes package provides the step_discretize() step to do discretization. The num_breaks argument determines how many “bins” to cut the variable into and the min_unique argument controls the minimum number of unique values that must be in a bin.

Another common categorical processing step is the creation of dummy variables (also called indicator variables). Dummy variables can be created with step_dummy(). Essentially, a new binary variable is created for each value of the categorical variable. Each new variable is 1 for every row where the original categorical variable had the associated value of the new variable, and 0 otherwise (e.g., for the variable sex, sex_male is 1 for each row originally labeled male in sex and 0 otherwise). Traditionally, creating dummy variables results in one fewer new variable than levels in the original categorical variable (e.g., “dummying” sex creates one new variable, sex_male, because there are only two levels in sex and the female level can be implied from where sex_male is 0). Set the argument one_hot to TRUE to “one-hot-encode” a categorical variable, which creates a new binary variable for each level.

recipe_07_obj <- recipe_obj %>% 
  step_naomit(all_predictors()) %>% 
  step_select(body_mass_g, sex, island) %>% 
  
  step_discretize(body_mass_g, num_breaks = 7, min_unique = 3) %>% 
  step_dummy(sex, keep_original_cols = TRUE) %>% 
  step_dummy(island, one_hot = TRUE, keep_original_cols = TRUE)

recipe_07_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Operations:

Removing rows with NA values in all_predictors()
Variables selected body_mass_g, sex, island
Discretize numeric variables from body_mass_g
Dummy variables from sex
Dummy variables from island

Passing this new recipe to the prep() and juice() shows that body_mass_g is now a categorical variable, and there are now binary columns for sex and island. Note that step_select() was used to select only the three variables of interest before transformations were applied so that the result would print nicely. Note also that the argument keep_original_cols in step_dummy() was set to TRUE so that the new binary variables could be compared to the original features.

recipe_07_obj %>% 
  prep() %>% 
  juice()

# A tibble: 231 × 7
   body_mass_g sex    island sex_male island_Biscoe island_Dream island_Torger…¹
   <fct>       <fct>  <fct>     <dbl>         <dbl>        <dbl>           <dbl>
 1 bin1        female Biscoe        0             1            0               0
 2 bin7        male   Biscoe        1             1            0               0
 3 bin2        male   Dream         1             0            1               0
 4 bin2        male   Biscoe        1             1            0               0
 5 bin5        female Biscoe        0             1            0               0
 6 bin2        female Biscoe        0             1            0               0
 7 bin5        female Biscoe        0             1            0               0
 8 bin2        female Dream         0             0            1               0
 9 bin5        female Biscoe        0             1            0               0
10 bin1        female Biscoe        0             1            0               0
# … with 221 more rows, and abbreviated variable name ¹island_Torgersen

Principal Components

All previous steps were transformations applied to a single feature. There are other transformation steps that take in and transform multiple features into one or more engineered features. Perhaps the most common method is principal components. Principal components can be implemented using the step_pca() function. The num_comp argument is used to select the number of components to return. Note that:

Principal components are only valid for numeric predictors
- step_select() was used to select all_numeric_predictors()
Principal components does not work for missing values
- step_naomit() was used to remove missing values in all_predictors() (that is, the numeric predictors since all non-numeric predictors were removed in the previous step)
Principal components requires normalized continuous variables
- step_normalize() was used to normalize all_predictors()

recipe_08_obj <- recipe_obj %>% 
  step_select(all_numeric_predictors()) %>%   
  step_naomit(all_predictors()) %>%
  
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors(), num_comp = 3)

recipe_08_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Operations:

Variables selected all_numeric_predictors()
Removing rows with NA values in all_predictors()
Centering and scaling for all_predictors()
PCA extraction with all_predictors()

Passing this new recipe to the prep() and juice() functions returns the requested first three principal components.

recipe_08_obj %>% 
  prep() %>% 
  juice()

# A tibble: 238 × 3
      PC1    PC2    PC3
    <dbl>  <dbl>  <dbl>
 1 -2.26   1.12  0.507 
 2  2.63  -0.451 0.0955
 3 -0.478 -1.15  0.945 
 4 -1.75  -0.213 0.0845
 5  1.51   1.15  0.176 
 6 -1.58   0.264 0.120 
 7  1.75   0.903 0.120 
 8 -0.719 -0.145 0.880 
 9  1.52   1.31  0.634 
10 -1.81   1.21  0.574 
# … with 228 more rows

General Transformations

There are many other steps available in the recipes package, far more than can be covered in one post. Further steps can be found on the package’s website under the reference section.

If there are transformations that are not formally include via a step_* function, then they can most likely be created manually using the step_mutate() function, which implements the mutate() function from the dplyr package. These variables can be created just as any new variable is created when using the mutate() function normally. Printing out the recipe object shows the code used to create the new variable.

set.seed(1915)

recipe_09_obj <- recipe_obj %>% 
  step_mutate(random_id = sample.int(n = nrow(train_tbl)))

recipe_09_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Operations:

Variable mutation for sample.int(n = nrow(train_tbl))

Passing this new recipe to the prep() and juice() functions returns the data set with the newly created variable.

recipe_09_obj %>% 
  prep() %>% 
  juice()

# A tibble: 240 × 8
   island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex   species rando…³
   <fct>           <dbl>         <dbl>       <int>   <int> <fct> <fct>     <int>
 1 Biscoe           36.5          16.6         181    2850 fema… Adelie      208
 2 Biscoe           52.5          15.6         221    5450 male  Gentoo      178
 3 Dream            49.7          18.6         195    3600 male  Chinst…      60
 4 Biscoe           40.6          18.6         183    3550 male  Adelie      231
 5 Biscoe           44.9          13.8         212    4750 fema… Gentoo       73
 6 Biscoe           39.6          17.7         186    3500 fema… Adelie      197
 7 Biscoe           45.8          14.2         219    4700 fema… Gentoo       78
 8 Dream            45.9          17.1         190    3575 fema… Chinst…     132
 9 Biscoe           46.1          13.2         211    4500 fema… Gentoo       11
10 Biscoe           37.7          16           183    3075 fema… Adelie       29
# … with 230 more rows, and abbreviated variable names ¹flipper_length_mm,
#   ²body_mass_g, ³random_id

Non-Transforming Steps

Lastly, aside from pre-processing transformations, other steps are also available in the recipes package for manipulating a data set that do not result in a new or transformed feature. Three examples are implementations of the dplyr verbs select(), filter(), and arrange().

The step_select() function, as seen earlier, selects columns to remain in the data set after passing the recipe through prep() and juice() or bake(). Selection helpers such as all_outcomes(), all_predictors(), and all_numeric_predictors() can be used to help selecting columns. The step_filter() function filters one or more columns for specific values. The step_arrange() function arranges the data by the values of one or more columns.

recipe_10_obj <- recipe_obj %>% 
  step_select(all_outcomes(), all_numeric_predictors()) %>% 
  step_filter(flipper_length_mm < 200) %>% 
  step_arrange(species)

recipe_10_obj

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Operations:

Variables selected all_outcomes(), all_numeric_predictors()
Row filtering using flipper_length_mm < 200
Row arrangement using species

Passing this new recipe to the prep() and juice() functions returns the data set with the requested selection(s), filter(s), and arrangement(s).

recipe_10_obj %>% 
  prep() %>% 
  juice()

# A tibble: 130 × 5
   species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>            <dbl>         <dbl>             <int>       <int>
 1 Adelie            36.5          16.6               181        2850
 2 Adelie            40.6          18.6               183        3550
 3 Adelie            39.6          17.7               186        3500
 4 Adelie            37.7          16                 183        3075
 5 Adelie            37.3          20.5               199        3775
 6 Adelie            39.2          18.6               190        4250
 7 Adelie            37            16.9               185        3000
 8 Adelie            42.1          19.1               195        4000
 9 Adelie            38.8          17.2               180        3800
10 Adelie            36            18.5               186        3100
# … with 120 more rows

Notes

In addition to the recipes package website, another excellent resource for feature engineering is the Feature Engineering and Selection book by Max Kuhn and Kjell Johnson.

This post is based on a presentation that was given on the date listed. It may be updated from time to time to fix errors, detail new functions, and/or remove deprecated functions so the packages and R version will likely be newer than what was available at the time.

The R session information used for this post:

sessionInfo()

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.0

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] rsample_1.1.1 recipes_1.0.4 dplyr_1.0.10 

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0    xfun_0.40           purrr_0.3.5        
 [4] listenv_0.8.0       splines_4.2.1       lattice_0.20-45    
 [7] vctrs_0.6.3         generics_0.1.3      htmltools_0.5.3    
[10] yaml_2.3.5          utf8_1.2.2          survival_3.3-1     
[13] prodlim_2019.11.13  rlang_1.1.1         pillar_1.8.1       
[16] withr_2.5.0         glue_1.6.2          lifecycle_1.0.3    
[19] lava_1.7.1          stringr_1.5.0       timeDate_4022.108  
[22] future_1.29.0       codetools_0.2-18    evaluate_0.16      
[25] knitr_1.40          fastmap_1.1.0       parallel_4.2.1     
[28] class_7.3-20        fansi_1.0.3         furrr_0.3.1        
[31] Rcpp_1.0.9          renv_0.16.0         ipred_0.9-13       
[34] jsonlite_1.8.0      parallelly_1.32.1   digest_0.6.29      
[37] stringi_1.7.12      grid_4.2.1          hardhat_1.2.0      
[40] cli_3.6.1           tools_4.2.1         magrittr_2.0.3     
[43] tibble_3.1.8        tidyr_1.2.1         future.apply_1.10.0
[46] pkgconfig_2.0.3     ellipsis_0.3.2      MASS_7.3-57        
[49] Matrix_1.4-1        lubridate_1.9.0     timechange_0.1.1   
[52] gower_1.0.1         rmarkdown_2.16      rstudioapi_0.14    
[55] R6_2.5.1            globals_0.16.2      rpart_4.1.16       
[58] nnet_7.3-17         compiler_4.2.1