rstudio::conf 2020 Tidy Forecasting Workshop (Day 1)

Data Science
R
Time Series
Author

Robert Lankford

Published

February 4, 2020

This post documents my experience in the Tidy Time Series and Forecasting in R workshop at rstudio::conf 2020 that took place in San Francisco, CA during January 27-30, 2020.

The workshop took place over two days. This post covers the first day.


Following what was presented at the two-day workshop, this first post will be “everything that is not forecasting”. The first day focused on understanding the architecture that underpins the tidyverts, how to visualize time series, and ways to transform a time series.

Background

I have recently started working on more time series forecasting projects. As these requests keep increasing, I have been searching for a more streamlined approach to time series modeling and forecasting in R. While R has a variety of tools to handle time series, I wanted to use something that has both a consistent API (so that my documentation remains consistent between projects) and the ability to experiment with many models as quickly as possible.

I stumbled across the tidyverts family of packages about halfway through my first batch of forecasting projects and have been working to incorporate them into my workflow.

Last week, I had the opportunity to go to San Francisco for rstudio::conf 2020 and take the Tidy Time Series and Forecasting in R workshop with Dr. Rob Hyndman. In this post, I will document some of the most interesting and useful things I picked up at the workshop.

Setup

The following packages are required:

The following data sets are used throughout this post. These data are sourced from the tsibbledata package.

  1. Monthly retail turnover for department stores in Victoria, Australia
  2. Quarterly sheep meat production in Victoria, Australia
  3. Monthly turnover for food retail in Victoria, Australia
victoria_dept_stores_tsbl <- aus_retail %>% 
  filter(
    Industry == "Department stores",
    State == "Victoria",
    year(Month) >= 2000
  ) %>% 
  select(Month, Turnover)

victoria_dept_stores_tsbl
# A tsibble: 228 x 2 [1M]
      Month Turnover
      <mth>    <dbl>
 1 2000 Jan     228 
 2 2000 Feb     196.
 3 2000 Mar     224.
 4 2000 Apr     261.
 5 2000 May     265.
 6 2000 Jun     299.
 7 2000 Jul     206.
 8 2000 Aug     240.
 9 2000 Sep     251.
10 2000 Oct     270.
# … with 218 more rows
victoria_sheep_tsbl <- aus_livestock %>% 
  filter(State == "Victoria", Animal == "Sheep")

victoria_sheep_tsbl
# A tsibble: 558 x 4 [1M]
# Key:       Animal, State [1]
      Month Animal State      Count
      <mth> <fct>  <fct>      <dbl>
 1 1972 Jul Sheep  Victoria  757500
 2 1972 Aug Sheep  Victoria  685300
 3 1972 Sep Sheep  Victoria  639700
 4 1972 Oct Sheep  Victoria  876100
 5 1972 Nov Sheep  Victoria 1047900
 6 1972 Dec Sheep  Victoria  962900
 7 1973 Jan Sheep  Victoria 1025100
 8 1973 Feb Sheep  Victoria  908400
 9 1973 Mar Sheep  Victoria  919800
10 1973 Apr Sheep  Victoria  240300
# … with 548 more rows
victoria_food_retail_tsbl <- aus_retail %>% 
  filter(State == "Victoria", Industry == "Food retailing") %>% 
  select(Month, Turnover)

victoria_food_retail_tsbl
# A tsibble: 441 x 2 [1M]
      Month Turnover
      <mth>    <dbl>
 1 1982 Apr     310.
 2 1982 May     310.
 3 1982 Jun     314.
 4 1982 Jul     320.
 5 1982 Aug     300.
 6 1982 Sep     316.
 7 1982 Oct     344.
 8 1982 Nov     352.
 9 1982 Dec     408.
10 1983 Jan     330.
# … with 431 more rows

The tsibble Object

Turning a tibble (or a regular data.frame) into a tsibble is done with the as_tsibble() function. Specify the index argument as the name of your time index column. When printing a tsibble, the top line will tell you:

  1. That it is a tsibble object
  2. The dimension of the tsibble (rows by columns)
  3. The frequency of the time index in brackets (below, 1D is a frequency of one day)
dates <- seq.Date(
  from = as_date("2020-01-01"), 
  to   = as_date("2020-01-05"), 
  by   = "day"
)

data_tbl  <- tibble(time = dates, value = c("a", "b", "c", "d", "e"))
data_tsbl <- as_tsibble(data_tbl, index = time)
data_tbl
# A tibble: 5 × 2
  time       value
  <date>     <chr>
1 2020-01-01 a    
2 2020-01-02 b    
3 2020-01-03 c    
4 2020-01-04 d    
5 2020-01-05 e    
data_tsbl
# A tsibble: 5 x 2 [1D]
  time       value
  <date>     <chr>
1 2020-01-01 a    
2 2020-01-02 b    
3 2020-01-03 c    
4 2020-01-04 d    
5 2020-01-05 e    

Time Series Plots

Often times, simply plotting time series data is overlooked. It can be one of the most important steps in understanding your data well enough to produce a reasonable forecast. There are many methods the tidyverts provides to visualize time series data. Most of those functions are found in the feasts package.

Basic Time Series Plot

The most basic plot for time series data is an ordered line plot. This can easily be handled with the autoplot() function. The output is a ggplot2 object, so you can add other layers such as labels and themes.

victoria_dept_stores_tsbl %>% 
  autoplot(Turnover) +
  expand_limits(y = 0) +
  theme_minimal() +
  scale_y_continuous(
    labels = scales::label_dollar(prefix = "$", suffix = "M")
  ) +
  labs(
    title    = "Yearly Retail Turnover for Victoria, Australia",
    subtitle = "Sourced from the tsibbledata package",
    caption  = "Original source: Australian Bureau of Statistics",
    x        = "",
    y        = "Australian Dollars"
  )

To tease out possible seasonality, the gg_season() function can be used. This plots an individual line plot for each “one-level-up” time period specified in your tsibble. In the example below, the tsibble has monthly data specified, so each line plot is a single year (a year is “one-level-up” from a month). This plot helps to see if there are seasonal patterns that repeat, for this example, in the same months every year.

victoria_dept_stores_tsbl %>% 
  gg_season(Turnover, labels = "right") +
  expand_limits(y = 0) +
  theme_minimal() +
  scale_y_continuous(
    labels = scales::label_dollar(prefix = "$", suffix = "M")
  ) +
  labs(
    title    = "Seasonality of Yearly Retail Turnover for Victoria",
    subtitle = "Generated with the 'gg_season()' function",
    x        = "",
    y        = "Australian Dollars"
  )

Seasonal Subseries Plot

Another good way to examine seasonality is to fully separate out each season. This can be done with the gg_subseries() function. As seen below, department store turnover (unsurprisingly) peaks each December. From the previous plot, we saw that this repeats every year in the data. Whereas the last plot plots out a line for each “one-step-up” time index, the plot below plots out a line for each time index. In these examples, the previous plot plotted a line for each year (across all months in those years), the plot below plots a line for each month (across all years, grouping on the month).

victoria_dept_stores_tsbl %>% 
  gg_subseries(Turnover) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(
    labels = scales::label_dollar(prefix = "$", suffix = "M"), 
    limits = c(0, NA)
  ) +
  labs(
    title    = "Seasonality of Yearly Retail Turnover for Victoria",
    subtitle = "Generated with the 'gg_subseries()' function",
    x        = "",
    y        = "Australian Dollars"
  )

Lag Plot

Often we want to see how the current values of a time series relate to their past values. This can be done with the gg_lag() function, which plots each value with its n-th lag. Below, we can see how the lags correlate across months. There is some correlation across the lags, but the 12th lag shows the strongest. This is not surprising, as we have already seen that there is yearly seasonality in this data.

victoria_dept_stores_tsbl %>% 
  gg_lag(Turnover, geom = "point", lags = 1:12) +
  theme_bw() +
  scale_x_continuous(
    labels = scales::label_dollar(prefix = "$", suffix = "M"),
    limits = c(0, NA), 
    breaks = c(0, 300, 600)
  ) +
  scale_y_continuous(
    labels = scales::label_dollar(prefix = "$", suffix = "M"), 
    limits = c(0, NA), 
    breaks = c(0, 300, 600)
  ) +
  labs(
    title    = "Monthly Lagged Correlations of Retail Turnover for Victoria",
    subtitle = "Generated with the 'gg_lag()' function",
    x        = "Australian Dollars (original)",
    y        = "Australian Dollars (lagged)"
  )

Autocorrelation Plot

To further investigate, we can calculate and plot the autocorrelations. The autocorrelation function (ACF) calculates the correlation of the observations against various lags. This can be done using the ACF() function along with the autoplot() function.

victoria_dept_stores_tsbl %>% 
  ACF(Turnover, lag_max = 24) %>% 
  autoplot() +
  theme_minimal() +
  labs(
    title    = "Autocorrelation Plot of Retail Turnover for Victoria",
    subtitle = "Generated with the 'gg_lag()' function",
    x        = "Lag (1 month)",
    y        = "ACF"
  )

Transformations

It is common in time series problems to apply some transformation to the data to stabilize its variance. The following plot shows a time series where the variance changes quite drastically over time.

victoria_sheep_tsbl %>% 
  autoplot(Count) +
  theme_minimal() +
  scale_y_continuous(
    labels = scales::label_comma(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title    = "Quarterly Sheep Meat Production for Victoria",
    subtitle = "Sourced from the tsibbledata package",
    caption  = "Original source: Australian Bureau of Statistics",
    x        = "Quarter",
    y        = "Count of Slaughtered Sheep"
  )

Common transformations include:

  • log
  • square root
  • inverse
victoria_sheep_tsbl %>% 
  rename(Original = Count) %>% 
  mutate(
    Log           = log(Original),
    `Square Root` = sqrt(Original),
    Inverse       = 1 / Original
  ) %>% 
  pivot_longer(cols = Original:Inverse, names_to = "Transformation") %>% 
  as_tsibble(key = Transformation, index = Month) %>% 
  
  autoplot(value) +
  
  facet_wrap(~ Transformation, ncol = 1, scales = "free_y") +
  theme_bw() +
  theme(legend.position = "none") +
  labs(
    title    = "Quarterly Sheep Meat Production for Victoria",
    subtitle = "Different transformations applied to original series",
    x        = "Quarter",
    y        = "(Transformation of the) Count of Slaughtered Sheep"
  )

Some transformations seem to not stabilize the variance enough, while others seem to go too far.

Another common transformation methodology is a Box-Cox Transformation. The lambda value controls the type and strength of the transformation. While there is not necessarily a widely accepted formal method for determining the optimal lambda value, the Guerrero method can often be used as a good starting point. We can implement this method by passing the guerrero() function into the features() function from the fabletools package.

victoria_sheep_tsbl %>% 
  features(Count, guerrero) %>% 
  mutate(across(where(is.numeric), round, digits = 2)) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Animal State lambda_guerrero
Sheep Victoria -0.21

While the Guerrero method implemented above results in a negative lambda value, Dr. Hyndman mentioned that he has never used a negative lambda and recommends that it be avoided.

Another option for finding a reasonable lambda value is to simply try several values of lambda and visually inspect the results. The Box-Cox transformed values can be calculated using the box_cox() function. Below, we try a few common values of lambda and plot the results.

tibble(lambda = c(0, 0.5, 1, 2)) %>% 
  mutate(
    victoria_sheep_transformed = map(
      .x = lambda,
      .f = ~ victoria_sheep_tsbl %>% 
        as_tibble() %>% 
        mutate(Count = box_cox(Count, .x))
    )
  ) %>% 
  unnest(victoria_sheep_transformed) %>% 
  mutate(lambda = as.factor(lambda)) %>% 
  
  ggplot(aes(x = Month, y = Count, color = lambda)) +
  geom_line() +
  facet_wrap(~ lambda, ncol = 1, scales = "free_y") +
  theme_bw() +
  theme(legend.position = "none") +
  labs(
    title    = "Box-Cox Transformations of Quarterly Sheep Meat Production",
    subtitle = "Different values of lambda applied to original series",
    x        = "Quarter",
    y        = "(Transformation of the) Count of Slaughtered Sheep"
  )

None of these values look that great.

To demonstrate what a successful variance stabilization would look like, we can look at what happens with the following data set:

victoria_food_retail_tsbl %>% 
  rename(Original = Turnover) %>% 
  mutate(
    `Square Root` = Original ^ (1/2),
    Log           = log(Original)
  ) %>% 
  pivot_longer(
    cols         = -Month, 
    names_to     = "Transformation",
    names_ptypes = list(
      Transformation = factor(levels = c("Original", "Square Root", "Log"))
    )
  ) %>% 
  as_tsibble(index = Month, key = Transformation) %>% 
  
  autoplot(value) +
  
  facet_wrap(~ Transformation, ncol = 1, scales = "free_y") +
  theme_bw() +
  theme(legend.position = "none") +
  labs(
    title    = "Yearly Food Retail Turnover for Victoria, Australia",
    subtitle = "Different transformations applied to the original series",
    x        = "Quarter",
    y        = "(Transformation of) Australian Dollars ($M)"
  )

Recap

The first day of the workshop focused on all the foundation work to build up to forecasting. We covered:

  1. What a tsibble is
  2. time series plots
  3. time series transformations

The next parts will document day 2 of the workshop. Day 2 consisted of:

  1. Building time series models
  2. Validating time series models
  3. Producing forecasts
  4. Determining forecast accuracy

Notes

This post is based on a presentation that was given on the date listed. It may be updated from time to time to fix errors, detail new functions, and/or remove deprecated functions so the packages and R version will likely be newer than what was available at the time.

The R session information used for this post:

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.1.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] kableExtra_1.3.4  ggplot2_3.4.0     fable_0.3.2       feasts_0.3.0     
 [5] fabletools_0.3.2  tsibbledata_0.4.1 tsibble_1.1.3     stringr_1.4.1    
 [9] lubridate_1.9.0   timechange_0.1.1  purrr_0.3.5       tidyr_1.2.1      
[13] dplyr_1.0.10     

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0     xfun_0.40            colorspace_2.0-3    
 [4] vctrs_0.6.3          generics_0.1.3       viridisLite_0.4.1   
 [7] htmltools_0.5.3      yaml_2.3.5           utf8_1.2.2          
[10] rlang_1.1.1          pillar_1.8.1         glue_1.6.2          
[13] withr_2.5.0          rappdirs_0.3.3       distributional_0.3.1
[16] lifecycle_1.0.3      munsell_0.5.0        anytime_0.3.9       
[19] gtable_0.3.1         rvest_1.0.3          evaluate_0.16       
[22] labeling_0.4.2       knitr_1.40           fastmap_1.1.0       
[25] fansi_1.0.3          highr_0.9            Rcpp_1.0.9          
[28] renv_0.16.0          scales_1.2.1         webshot_0.5.4       
[31] jsonlite_1.8.0       systemfonts_1.0.4    farver_2.1.1        
[34] digest_0.6.29        stringi_1.7.12       grid_4.2.1          
[37] cli_3.6.1            tools_4.2.1          magrittr_2.0.3      
[40] tibble_3.1.8         pkgconfig_2.0.3      ellipsis_0.3.2      
[43] xml2_1.3.3           rmarkdown_2.16       svglite_2.1.0       
[46] httr_1.4.4           rstudioapi_0.14      R6_2.5.1            
[49] compiler_4.2.1