This post covers the recipes
package, which is all about feature engineering.
Setup
Packages
The following packages are required:
Data
Throughout this series, I utilized the penguins data set from the modeldata package.
# A tibble: 344 × 7
species island bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 fema…
3 Adelie Torgersen 40.3 18 195 3250 fema…
4 Adelie Torgersen NA NA NA NA <NA>
5 Adelie Torgersen 36.7 19.3 193 3450 fema…
6 Adelie Torgersen 39.3 20.6 190 3650 male
7 Adelie Torgersen 38.9 17.8 181 3625 fema…
8 Adelie Torgersen 39.2 19.6 195 4675 male
9 Adelie Torgersen 34.1 18.1 193 3475 <NA>
10 Adelie Torgersen 42 20.2 190 4250 <NA>
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
From the previous post on the rsample
package, we take a 70/30 train/test initial_split()
and extract the training()
and testing()
data sets.
set.seed(1914)
initial_split_obj <- initial_split(penguins_tbl, prop = 0.7)
train_tbl <- training(initial_split_obj)
test_tbl <- testing(initial_split_obj)
train_tbl
# A tibble: 240 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Biscoe 36.5 16.6 181 2850 fema…
2 Gentoo Biscoe 52.5 15.6 221 5450 male
3 Chinstrap Dream 49.7 18.6 195 3600 male
4 Adelie Biscoe 40.6 18.6 183 3550 male
5 Gentoo Biscoe 44.9 13.8 212 4750 fema…
6 Adelie Biscoe 39.6 17.7 186 3500 fema…
7 Gentoo Biscoe 45.8 14.2 219 4700 fema…
8 Chinstrap Dream 45.9 17.1 190 3575 fema…
9 Gentoo Biscoe 46.1 13.2 211 4500 fema…
10 Adelie Biscoe 37.7 16 183 3075 fema…
# … with 230 more rows, and abbreviated variable name ¹body_mass_g
test_tbl
# A tibble: 104 × 7
species island bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torgersen 40.3 18 195 3250 fema…
2 Adelie Torgersen 38.9 17.8 181 3625 fema…
3 Adelie Torgersen 37.8 17.3 180 3700 <NA>
4 Adelie Torgersen 34.6 21.1 198 4400 male
5 Adelie Torgersen 36.6 17.8 185 3700 fema…
6 Adelie Torgersen 38.7 19 195 3450 fema…
7 Adelie Torgersen 34.4 18.4 184 3325 fema…
8 Adelie Biscoe 37.7 18.7 180 3600 male
9 Adelie Biscoe 40.5 18.9 180 3950 male
10 Adelie Dream 39.5 17.8 188 3300 fema…
# … with 94 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Basic Functionality
The recipes
package provides methods for chaining together data pre-processing steps using a pipe (e.g., %>%
), in much the same way that the dplyr
package provides data wrangling verbs that can be piped together to combine multiple steps into one coherent, easy-to-read statement.
To begin a recipe, use the recipe()
function and provide as arguments the model formula (e.g., as if you were supplying a formula()
to the lm()
function) and the data set you want to process (typically, the training data). Printing out the recipe shows how many outcome (response) and predictor variables you have specified in your model formula. The recipes
package applies “roles” to each variable based on the model formula you specify.
recipe_obj <- recipe(species ~ ., train_tbl)
recipe_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Once the recipe object is created, sequential pre-processing steps can begin being added to it. After all steps have been added, the recipe is passed to the prep()
function, which will calculate any parameters required by the pre-processing steps to prepare them to be applied to a data set. Printing out a recipe object at this point also adds additional information such as the number of data points and how many rows contain at least one missing value.
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Training data contained 240 data points and 9 incomplete rows.
Once a recipe has been prepared, it can be applied to a data set. To apply the pre-processing steps to the data set originally supplied to the recipe()
function, use the juice()
function, which applies the steps and extracts the transformed data set. To apply the pre-processing steps to any other data set, use the bake()
function, supplying the data set to the new_data
argument.
# A tibble: 240 × 7
island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex species
<fct> <dbl> <dbl> <int> <int> <fct> <fct>
1 Biscoe 36.5 16.6 181 2850 fema… Adelie
2 Biscoe 52.5 15.6 221 5450 male Gentoo
3 Dream 49.7 18.6 195 3600 male Chinst…
4 Biscoe 40.6 18.6 183 3550 male Adelie
5 Biscoe 44.9 13.8 212 4750 fema… Gentoo
6 Biscoe 39.6 17.7 186 3500 fema… Adelie
7 Biscoe 45.8 14.2 219 4700 fema… Gentoo
8 Dream 45.9 17.1 190 3575 fema… Chinst…
9 Biscoe 46.1 13.2 211 4500 fema… Gentoo
10 Biscoe 37.7 16 183 3075 fema… Adelie
# … with 230 more rows, and abbreviated variable name ¹body_mass_g
# A tibble: 104 × 7
island bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex species
<fct> <dbl> <dbl> <int> <int> <fct> <fct>
1 Torgersen 40.3 18 195 3250 fema… Adelie
2 Torgersen 38.9 17.8 181 3625 fema… Adelie
3 Torgersen 37.8 17.3 180 3700 <NA> Adelie
4 Torgersen 34.6 21.1 198 4400 male Adelie
5 Torgersen 36.6 17.8 185 3700 fema… Adelie
6 Torgersen 38.7 19 195 3450 fema… Adelie
7 Torgersen 34.4 18.4 184 3325 fema… Adelie
8 Biscoe 37.7 18.7 180 3600 male Adelie
9 Biscoe 40.5 18.9 180 3950 male Adelie
10 Dream 39.5 17.8 188 3300 fema… Adelie
# … with 94 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Missing Values
One of the most common data pre-processing steps is dealing with missing values. For the penguins data set, both the training and the testing data sets have missing values.
train_tbl %>%
filter(if_any(everything(), is.na))
# A tibble: 9 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torgersen NA NA NA NA <NA>
2 Gentoo Biscoe 44.5 14.3 216 4100 <NA>
3 Adelie Torgersen 34.1 18.1 193 3475 <NA>
4 Gentoo Biscoe NA NA NA NA <NA>
5 Adelie Torgersen 37.8 17.1 186 3300 <NA>
6 Gentoo Biscoe 44.5 15.7 217 4875 <NA>
7 Adelie Torgersen 42 20.2 190 4250 <NA>
8 Adelie Dream 37.5 18.9 179 2975 <NA>
9 Gentoo Biscoe 47.3 13.8 216 4725 <NA>
# … with abbreviated variable name ¹body_mass_g
test_tbl %>%
filter(if_any(everything(), is.na))
# A tibble: 2 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_…¹ sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torgersen 37.8 17.3 180 3700 <NA>
2 Gentoo Biscoe 46.2 14.4 214 4650 <NA>
# … with abbreviated variable name ¹body_mass_g
The recipes
package contains several steps to deal with missing values. The easiest method is to simply remove all rows with missing values using step_naomit()
. The example below removes rows in the training data that have NA
in the bill_length_mm
column. Adding steps to a recipe and then printing the result shows a running list of pre-processing steps (in the order that they will be applied and to which column(s)) under “Operations:”.
recipe_01_obj <- recipe_obj %>%
step_naomit(bill_length_mm)
recipe_01_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Removing rows with NA values in bill_length_mm
Passing this new recipe to the prep()
and juice()
functions shows fewer rows than were in the training data set, demonstrating that the rows with missing values were in fact removed.
# A tibble: 238 × 7
island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex species
<fct> <dbl> <dbl> <int> <int> <fct> <fct>
1 Biscoe 36.5 16.6 181 2850 fema… Adelie
2 Biscoe 52.5 15.6 221 5450 male Gentoo
3 Dream 49.7 18.6 195 3600 male Chinst…
4 Biscoe 40.6 18.6 183 3550 male Adelie
5 Biscoe 44.9 13.8 212 4750 fema… Gentoo
6 Biscoe 39.6 17.7 186 3500 fema… Adelie
7 Biscoe 45.8 14.2 219 4700 fema… Gentoo
8 Dream 45.9 17.1 190 3575 fema… Chinst…
9 Biscoe 46.1 13.2 211 4500 fema… Gentoo
10 Biscoe 37.7 16 183 3075 fema… Adelie
# … with 228 more rows, and abbreviated variable name ¹body_mass_g
While simply removing the missing values is an easy approach to dealing with them, there are times when removing data points is not desired. In those cases, the solution is often to impute the missing values; that is, estimate what those missing values would have been had they not been missing. The recipes
package offers several ways to impute missing values. In the recipe below, missing values are imputed using:
-
step_impute_mean()
: the mean of the other values for that feature (a numeric feature only) -
step_impute_median()
: the median of the other values for that feature (a numeric feature only) -
step_impute_knn()
: a K-Nearest Neighbors model (using the other features as predictors) -
step_impute_linear()
: a Linear Regression (using only theisland
feature by specifying it with theimp_vars()
function in theimpute_with
argument) -
step_impute_bag()
: a Bagged Tree Model (using all other features as predictors)
Printing out the new recipe shows these steps and the features to which they are applied.
recipe_02_obj <- recipe_obj %>%
step_impute_mean(flipper_length_mm) %>%
step_impute_median(body_mass_g) %>%
step_impute_knn(bill_length_mm) %>%
step_impute_linear(bill_depth_mm, impute_with = imp_vars(island)) %>%
step_impute_bag(sex)
recipe_02_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Mean imputation for flipper_length_mm
Median imputation for body_mass_g
K-nearest neighbor imputation for bill_length_mm
Linear regression imputation for bill_depth_mm
Bagged tree imputation for sex
Passing this new recipe to the prep()
and juice()
functions, and searching for the rows in the training data that had missing values, shows that the formerly missing values have now been imputed.
train_tbl %>%
mutate(row_id = row_number()) %>%
filter(if_any(everything(), is.na))
# A tibble: 9 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex row_id
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen NA NA NA NA <NA> 35
2 Gentoo Biscoe 44.5 14.3 216 4100 <NA> 38
3 Adelie Torgersen 34.1 18.1 193 3475 <NA> 43
4 Gentoo Biscoe NA NA NA NA <NA> 76
5 Adelie Torgersen 37.8 17.1 186 3300 <NA> 124
6 Gentoo Biscoe 44.5 15.7 217 4875 <NA> 143
7 Adelie Torgersen 42 20.2 190 4250 <NA> 161
8 Adelie Dream 37.5 18.9 179 2975 <NA> 169
9 Gentoo Biscoe 47.3 13.8 216 4725 <NA> 170
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
na_rows_int <- train_tbl %>%
mutate(row_id = row_number()) %>%
filter(if_any(everything(), is.na)) %>%
pull(row_id)
recipe_02_obj %>%
prep() %>%
juice() %>%
mutate(row_id = row_number()) %>%
filter(row_id %in% na_rows_int) %>%
select(species, everything())
# A tibble: 9 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex row_id
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 41.5 18.4 201 4025 male 35
2 Gentoo Biscoe 44.5 14.3 216 4100 fema… 38
3 Adelie Torgersen 34.1 18.1 193 3475 fema… 43
4 Gentoo Biscoe 40.7 15.8 201 4025 fema… 76
5 Adelie Torgersen 37.8 17.1 186 3300 fema… 124
6 Gentoo Biscoe 44.5 15.7 217 4875 fema… 143
7 Adelie Torgersen 42 20.2 190 4250 male 161
8 Adelie Dream 37.5 18.9 179 2975 fema… 169
9 Gentoo Biscoe 47.3 13.8 216 4725 fema… 170
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
Individual Transformations
After handling missing values, another extremely common data pre-processing step is applying transformations to individual features. Some typical ones include:
- Square Root:
step_sqrt()
- Logarithm:
step_log()
- Inverse:
step_inverse()
These transformations were applied in the example below. Printing out this recipe show the steps applied and to which feature. Notice that missing values are first removed, and specifically missing values in all_numeric_predictors()
. This function is a quick way to apply a step to, as its name would imply, all numeric predictors. While this may not save much time in a small data set like the one used here, it is quite convenient when a data set has dozens, hundreds, or thousands of features.
recipe_03_obj <- recipe_obj %>%
step_naomit(all_numeric_predictors()) %>%
step_sqrt(bill_length_mm) %>%
step_log(bill_depth_mm) %>%
step_log(flipper_length_mm) %>%
step_inverse(body_mass_g)
recipe_03_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Removing rows with NA values in all_numeric_predictors()
Square root transformation on bill_length_mm
Log transformation on bill_depth_mm
Log transformation on flipper_length_mm
Inverse transformation on body_mass_g
Passing this new recipe to the prep()
and juice()
functions shows that the selected features were transformed.
# A tibble: 238 × 7
island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex species
<fct> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 Biscoe 6.04 2.81 5.20 0.000351 fema… Adelie
2 Biscoe 7.25 2.75 5.40 0.000183 male Gentoo
3 Dream 7.05 2.92 5.27 0.000278 male Chinst…
4 Biscoe 6.37 2.92 5.21 0.000282 male Adelie
5 Biscoe 6.70 2.62 5.36 0.000211 fema… Gentoo
6 Biscoe 6.29 2.87 5.23 0.000286 fema… Adelie
7 Biscoe 6.77 2.65 5.39 0.000213 fema… Gentoo
8 Dream 6.77 2.84 5.25 0.000280 fema… Chinst…
9 Biscoe 6.79 2.58 5.35 0.000222 fema… Gentoo
10 Biscoe 6.14 2.77 5.21 0.000325 fema… Adelie
# … with 228 more rows, and abbreviated variable name ¹body_mass_g
Like with the step_impute_linear()
step used earlier, some of the individual transformation functions also provide additional arguments to change the functionality. For example, the step_log()
step has arguments for offset
(which is a value added to a data point before applying a log, is used to avoid taking the log of zero, and defaults to 0
) and base
(which is used to control the base of the logarithm and defaults to exp(1)
, the natural logarithm). Additionally, step_inverse()
has an offset
argument, which operates in the same way and is used to avoid trying to calculate 1 / 0
.
recipe_04_obj <- recipe_obj %>%
step_naomit(all_numeric_predictors()) %>%
step_sqrt(bill_length_mm) %>%
step_log(bill_depth_mm, offset = 1) %>%
step_log(flipper_length_mm, base = 10) %>%
step_inverse(body_mass_g, offset = 100)
recipe_04_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Removing rows with NA values in all_numeric_predictors()
Square root transformation on bill_length_mm
Log transformation on bill_depth_mm
Log transformation on flipper_length_mm
Inverse transformation on body_mass_g
Passing this new recipe to the prep()
and juice()
functions shows different results for bill_depth_mm
, flipper_length_mm
, and body_mass_g
because the additional arguments were used.
# A tibble: 238 × 7
island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex species
<fct> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 Biscoe 6.04 2.87 2.26 0.000339 fema… Adelie
2 Biscoe 7.25 2.81 2.34 0.000180 male Gentoo
3 Dream 7.05 2.98 2.29 0.000270 male Chinst…
4 Biscoe 6.37 2.98 2.26 0.000274 male Adelie
5 Biscoe 6.70 2.69 2.33 0.000206 fema… Gentoo
6 Biscoe 6.29 2.93 2.27 0.000278 fema… Adelie
7 Biscoe 6.77 2.72 2.34 0.000208 fema… Gentoo
8 Dream 6.77 2.90 2.28 0.000272 fema… Chinst…
9 Biscoe 6.79 2.65 2.32 0.000217 fema… Gentoo
10 Biscoe 6.14 2.83 2.26 0.000315 fema… Adelie
# … with 228 more rows, and abbreviated variable name ¹body_mass_g
Another common individual transformation is the Box Cox transformation. The Box Cox transformation is a power transformation that requires all values to be positive and is typically used to transform a non-normally distributed variable into a normally distributed one. It relies on a parameter lambda that dictates which power transformation will be done.
Lambda | Transformation |
---|---|
-3 | Inverse Cube |
-2 | Inverse Square |
-1 | Inverse |
-0.5 | Inverse Square Root |
0 | Logarithm |
0.5 | Square Root |
1 | No Transformation |
2 | Square |
3 | Cube |
The recipes
package provides the step_BoxCox()
step to apply a Box Cox transformation to one or more features. The lambdas
argument can be left blank for the function to select an optimal lambda, or a value can be passed to the argument. By default, the function searches all lambdas between -5 and 5, but different limits can be provided to the limits
argument.
recipe_05_obj <- recipe_obj %>%
step_naomit(flipper_length_mm) %>%
step_BoxCox(bill_length_mm) %>%
step_BoxCox(bill_depth_mm, lambdas = 3) %>%
step_BoxCox(flipper_length_mm, limits = c(-3, 0))
recipe_05_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Removing rows with NA values in flipper_length_mm
Box-Cox transformation on bill_length_mm
Box-Cox transformation on bill_depth_mm
Box-Cox transformation on flipper_length_mm
Passing this new recipe to the prep()
and juice()
functions shows different results for bill_length_mm
, bill_depth_mm
, and flipper_length_mm
because the default arguments were changed.
# A tibble: 238 × 7
island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex species
<fct> <dbl> <dbl> <dbl> <int> <fct> <fct>
1 Biscoe 12.4 13.7 0.661 2850 fema… Adelie
2 Biscoe 15.8 12.9 0.662 5450 male Gentoo
3 Dream 15.2 15.4 0.661 3600 male Chinst…
4 Biscoe 13.3 15.4 0.661 3550 male Adelie
5 Biscoe 14.2 11.4 0.661 4750 fema… Gentoo
6 Biscoe 13.1 14.6 0.661 3500 fema… Adelie
7 Biscoe 14.4 11.7 0.662 4700 fema… Gentoo
8 Dream 14.4 14.1 0.661 3575 fema… Chinst…
9 Biscoe 14.5 10.9 0.661 4500 fema… Gentoo
10 Biscoe 12.7 13.2 0.661 3075 fema… Adelie
# … with 228 more rows, and abbreviated variable name ¹body_mass_g
Normalization
Another common transformation for continuous variables is to apply is normalization. Usually, this process results in the transformed variable having a mean of 0 and a standard deviation of 1, and incorporates two steps:
- Subtract the average of the variable from each data point
- Divide each data point by the standard deviation of the variable
Centering the variable (subtracting its mean) can be accomplished using step_center()
. Scaling the variable (dividing by its standard deviation) can be accomplished using step_scale()
. Note that the default value of the factor
argument is 1
, which scales the variable to have a standard deviation of 1. A value of 2
can also be used to scale the variable to have a standard deviation of 2. The step_normalize()
function can be used to combine the steps, resulting in a mean of 0 and a standard deviation of 1.
Normalization has several definitions, all of which revolve around some form of adjusting the values to different scales. Adjusting to a custom scale can be done with step_range()
. The min
and max
arguments define the range within which the variable will be scaled.
recipe_06_obj <- recipe_obj %>%
step_naomit(all_predictors()) %>%
step_center(bill_length_mm) %>%
step_scale(bill_depth_mm, factor = 2) %>%
step_normalize(flipper_length_mm) %>%
step_range(body_mass_g, min = -1, max = 1)
recipe_06_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Removing rows with NA values in all_predictors()
Centering for bill_length_mm
Scaling for bill_depth_mm
Centering and scaling for flipper_length_mm
Range scaling to [-1,1] for body_mass_g
Passing this new recipe to the prep()
and juice()
functions shows the various transformations for bill_length_mm
, bill_depth_mm
, flipper_length_mm
, and body_mass_g
. Notice that three of the variables now have negative values due to the scaling.
# A tibble: 231 × 7
island bill_length_mm bill_depth_mm flipper_length_mm body_ma…¹ sex species
<fct> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 Biscoe -7.48 4.16 -1.41 -0.917 fema… Adelie
2 Biscoe 8.52 3.91 1.36 0.528 male Gentoo
3 Dream 5.72 4.67 -0.440 -0.5 male Chinst…
4 Biscoe -3.38 4.67 -1.27 -0.528 male Adelie
5 Biscoe 0.916 3.46 0.740 0.139 fema… Gentoo
6 Biscoe -4.38 4.44 -1.06 -0.556 fema… Adelie
7 Biscoe 1.82 3.56 1.23 0.111 fema… Gentoo
8 Dream 1.92 4.29 -0.787 -0.514 fema… Chinst…
9 Biscoe 2.12 3.31 0.670 0 fema… Gentoo
10 Biscoe -6.28 4.01 -1.27 -0.792 fema… Adelie
# … with 221 more rows, and abbreviated variable name ¹body_mass_g
Discretization & Dummy Variables
The previous steps focused on transforming continuous variables. Categorical (also called nominal, discrete, etc.) variables can also be pre-processed. Firstly, continuous variables can be transformed into categorical variables using discretization. This is usually done by “cutting” the variable into continuous intervals. The recipes
package provides the step_discretize()
step to do discretization. The num_breaks
argument determines how many “bins” to cut the variable into and the min_unique
argument controls the minimum number of unique values that must be in a bin.
Another common categorical processing step is the creation of dummy variables (also called indicator variables). Dummy variables can be created with step_dummy()
. Essentially, a new binary variable is created for each value of the categorical variable. Each new variable is 1
for every row where the original categorical variable had the associated value of the new variable, and 0
otherwise (e.g., for the variable sex
, sex_male
is 1
for each row originally labeled male in sex
and 0
otherwise). Traditionally, creating dummy variables results in one fewer new variable than levels in the original categorical variable (e.g., “dummying” sex
creates one new variable, sex_male
, because there are only two levels in sex
and the female
level can be implied from where sex_male
is 0
). Set the argument one_hot
to TRUE
to “one-hot-encode” a categorical variable, which creates a new binary variable for each level.
recipe_07_obj <- recipe_obj %>%
step_naomit(all_predictors()) %>%
step_select(body_mass_g, sex, island) %>%
step_discretize(body_mass_g, num_breaks = 7, min_unique = 3) %>%
step_dummy(sex, keep_original_cols = TRUE) %>%
step_dummy(island, one_hot = TRUE, keep_original_cols = TRUE)
recipe_07_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Removing rows with NA values in all_predictors()
Variables selected body_mass_g, sex, island
Discretize numeric variables from body_mass_g
Dummy variables from sex
Dummy variables from island
Passing this new recipe to the prep()
and juice()
shows that body_mass_g
is now a categorical variable, and there are now binary columns for sex
and island
. Note that step_select()
was used to select only the three variables of interest before transformations were applied so that the result would print nicely. Note also that the argument keep_original_cols
in step_dummy()
was set to TRUE
so that the new binary variables could be compared to the original features.
# A tibble: 231 × 7
body_mass_g sex island sex_male island_Biscoe island_Dream island_Torger…¹
<fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 bin1 female Biscoe 0 1 0 0
2 bin7 male Biscoe 1 1 0 0
3 bin2 male Dream 1 0 1 0
4 bin2 male Biscoe 1 1 0 0
5 bin5 female Biscoe 0 1 0 0
6 bin2 female Biscoe 0 1 0 0
7 bin5 female Biscoe 0 1 0 0
8 bin2 female Dream 0 0 1 0
9 bin5 female Biscoe 0 1 0 0
10 bin1 female Biscoe 0 1 0 0
# … with 221 more rows, and abbreviated variable name ¹island_Torgersen
Principal Components
All previous steps were transformations applied to a single feature. There are other transformation steps that take in and transform multiple features into one or more engineered features. Perhaps the most common method is principal components. Principal components can be implemented using the step_pca()
function. The num_comp
argument is used to select the number of components to return. Note that:
- Principal components are only valid for numeric predictors
-
step_select()
was used to selectall_numeric_predictors()
-
- Principal components does not work for missing values
-
step_naomit()
was used to remove missing values inall_predictors()
(that is, the numeric predictors since all non-numeric predictors were removed in the previous step)
-
- Principal components requires normalized continuous variables
-
step_normalize()
was used to normalizeall_predictors()
-
recipe_08_obj <- recipe_obj %>%
step_select(all_numeric_predictors()) %>%
step_naomit(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors(), num_comp = 3)
recipe_08_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Variables selected all_numeric_predictors()
Removing rows with NA values in all_predictors()
Centering and scaling for all_predictors()
PCA extraction with all_predictors()
Passing this new recipe to the prep()
and juice()
functions returns the requested first three principal components.
# A tibble: 238 × 3
PC1 PC2 PC3
<dbl> <dbl> <dbl>
1 -2.26 1.12 0.507
2 2.63 -0.451 0.0955
3 -0.478 -1.15 0.945
4 -1.75 -0.213 0.0845
5 1.51 1.15 0.176
6 -1.58 0.264 0.120
7 1.75 0.903 0.120
8 -0.719 -0.145 0.880
9 1.52 1.31 0.634
10 -1.81 1.21 0.574
# … with 228 more rows
General Transformations
There are many other steps available in the recipes
package, far more than can be covered in one post. Further steps can be found on the package’s website under the reference section.
If there are transformations that are not formally include via a step_*
function, then they can most likely be created manually using the step_mutate()
function, which implements the mutate()
function from the dplyr
package. These variables can be created just as any new variable is created when using the mutate()
function normally. Printing out the recipe object shows the code used to create the new variable.
set.seed(1915)
recipe_09_obj <- recipe_obj %>%
step_mutate(random_id = sample.int(n = nrow(train_tbl)))
recipe_09_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Variable mutation for sample.int(n = nrow(train_tbl))
Passing this new recipe to the prep()
and juice()
functions returns the data set with the newly created variable.
# A tibble: 240 × 8
island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex species rando…³
<fct> <dbl> <dbl> <int> <int> <fct> <fct> <int>
1 Biscoe 36.5 16.6 181 2850 fema… Adelie 208
2 Biscoe 52.5 15.6 221 5450 male Gentoo 178
3 Dream 49.7 18.6 195 3600 male Chinst… 60
4 Biscoe 40.6 18.6 183 3550 male Adelie 231
5 Biscoe 44.9 13.8 212 4750 fema… Gentoo 73
6 Biscoe 39.6 17.7 186 3500 fema… Adelie 197
7 Biscoe 45.8 14.2 219 4700 fema… Gentoo 78
8 Dream 45.9 17.1 190 3575 fema… Chinst… 132
9 Biscoe 46.1 13.2 211 4500 fema… Gentoo 11
10 Biscoe 37.7 16 183 3075 fema… Adelie 29
# … with 230 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g, ³random_id
Non-Transforming Steps
Lastly, aside from pre-processing transformations, other steps are also available in the recipes
package for manipulating a data set that do not result in a new or transformed feature. Three examples are implementations of the dplyr
verbs select()
, filter()
, and arrange()
.
The step_select()
function, as seen earlier, selects columns to remain in the data set after passing the recipe through prep()
and juice()
or bake()
. Selection helpers such as all_outcomes()
, all_predictors()
, and all_numeric_predictors()
can be used to help selecting columns. The step_filter()
function filters one or more columns for specific values. The step_arrange()
function arranges the data by the values of one or more columns.
recipe_10_obj <- recipe_obj %>%
step_select(all_outcomes(), all_numeric_predictors()) %>%
step_filter(flipper_length_mm < 200) %>%
step_arrange(species)
recipe_10_obj
Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Variables selected all_outcomes(), all_numeric_predictors()
Row filtering using flipper_length_mm < 200
Row arrangement using species
Passing this new recipe to the prep()
and juice()
functions returns the data set with the requested selection(s), filter(s), and arrangement(s).
# A tibble: 130 × 5
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <dbl> <dbl> <int> <int>
1 Adelie 36.5 16.6 181 2850
2 Adelie 40.6 18.6 183 3550
3 Adelie 39.6 17.7 186 3500
4 Adelie 37.7 16 183 3075
5 Adelie 37.3 20.5 199 3775
6 Adelie 39.2 18.6 190 4250
7 Adelie 37 16.9 185 3000
8 Adelie 42.1 19.1 195 4000
9 Adelie 38.8 17.2 180 3800
10 Adelie 36 18.5 186 3100
# … with 120 more rows
Notes
In addition to the recipes
package website, another excellent resource for feature engineering is the Feature Engineering and Selection book by Max Kuhn and Kjell Johnson.
This post is based on a presentation that was given on the date listed. It may be updated from time to time to fix errors, detail new functions, and/or remove deprecated functions so the packages and R version will likely be newer than what was available at the time.
The R session information used for this post:
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.0
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] rsample_1.1.1 recipes_1.0.4 dplyr_1.0.10
loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 xfun_0.40 purrr_0.3.5
[4] listenv_0.8.0 splines_4.2.1 lattice_0.20-45
[7] vctrs_0.6.3 generics_0.1.3 htmltools_0.5.3
[10] yaml_2.3.5 utf8_1.2.2 survival_3.3-1
[13] prodlim_2019.11.13 rlang_1.1.1 pillar_1.8.1
[16] withr_2.5.0 glue_1.6.2 lifecycle_1.0.3
[19] lava_1.7.1 stringr_1.5.0 timeDate_4022.108
[22] future_1.29.0 codetools_0.2-18 evaluate_0.16
[25] knitr_1.40 fastmap_1.1.0 parallel_4.2.1
[28] class_7.3-20 fansi_1.0.3 furrr_0.3.1
[31] Rcpp_1.0.9 renv_0.16.0 ipred_0.9-13
[34] jsonlite_1.8.0 parallelly_1.32.1 digest_0.6.29
[37] stringi_1.7.12 grid_4.2.1 hardhat_1.2.0
[40] cli_3.6.1 tools_4.2.1 magrittr_2.0.3
[43] tibble_3.1.8 tidyr_1.2.1 future.apply_1.10.0
[46] pkgconfig_2.0.3 ellipsis_0.3.2 MASS_7.3-57
[49] Matrix_1.4-1 lubridate_1.9.0 timechange_0.1.1
[52] gower_1.0.1 rmarkdown_2.16 rstudioapi_0.14
[55] R6_2.5.1 globals_0.16.2 rpart_4.1.16
[58] nnet_7.3-17 compiler_4.2.1