This post documents my experience in the Tidy Time Series and Forecasting in R workshop at rstudio::conf 2020 that took place in San Francisco, CA during January 27-30, 2020.
The workshop took place over two days. This post covers the first day.
Following what was presented at the two-day workshop, this first post will be “everything that is not forecasting”. The first day focused on understanding the architecture that underpins the tidyverts
, how to visualize time series, and ways to transform a time series.
Background
I have recently started working on more time series forecasting projects. As these requests keep increasing, I have been searching for a more streamlined approach to time series modeling and forecasting in R. While R has a variety of tools to handle time series, I wanted to use something that has both a consistent API (so that my documentation remains consistent between projects) and the ability to experiment with many models as quickly as possible.
I stumbled across the tidyverts
family of packages about halfway through my first batch of forecasting projects and have been working to incorporate them into my workflow.
Last week, I had the opportunity to go to San Francisco for rstudio::conf 2020 and take the Tidy Time Series and Forecasting in R workshop with Dr. Rob Hyndman. In this post, I will document some of the most interesting and useful things I picked up at the workshop.
Setup
The following packages are required:
The following data sets are used throughout this post. These data are sourced from the tsibbledata
package.
- Monthly retail turnover for department stores in Victoria, Australia
- Quarterly sheep meat production in Victoria, Australia
- Monthly turnover for food retail in Victoria, Australia
victoria_dept_stores_tsbl <- aus_retail %>%
filter(
Industry == "Department stores",
State == "Victoria",
year(Month) >= 2000
) %>%
select(Month, Turnover)
victoria_dept_stores_tsbl
# A tsibble: 228 x 2 [1M]
Month Turnover
<mth> <dbl>
1 2000 Jan 228
2 2000 Feb 196.
3 2000 Mar 224.
4 2000 Apr 261.
5 2000 May 265.
6 2000 Jun 299.
7 2000 Jul 206.
8 2000 Aug 240.
9 2000 Sep 251.
10 2000 Oct 270.
# … with 218 more rows
victoria_sheep_tsbl <- aus_livestock %>%
filter(State == "Victoria", Animal == "Sheep")
victoria_sheep_tsbl
# A tsibble: 558 x 4 [1M]
# Key: Animal, State [1]
Month Animal State Count
<mth> <fct> <fct> <dbl>
1 1972 Jul Sheep Victoria 757500
2 1972 Aug Sheep Victoria 685300
3 1972 Sep Sheep Victoria 639700
4 1972 Oct Sheep Victoria 876100
5 1972 Nov Sheep Victoria 1047900
6 1972 Dec Sheep Victoria 962900
7 1973 Jan Sheep Victoria 1025100
8 1973 Feb Sheep Victoria 908400
9 1973 Mar Sheep Victoria 919800
10 1973 Apr Sheep Victoria 240300
# … with 548 more rows
victoria_food_retail_tsbl <- aus_retail %>%
filter(State == "Victoria", Industry == "Food retailing") %>%
select(Month, Turnover)
victoria_food_retail_tsbl
# A tsibble: 441 x 2 [1M]
Month Turnover
<mth> <dbl>
1 1982 Apr 310.
2 1982 May 310.
3 1982 Jun 314.
4 1982 Jul 320.
5 1982 Aug 300.
6 1982 Sep 316.
7 1982 Oct 344.
8 1982 Nov 352.
9 1982 Dec 408.
10 1983 Jan 330.
# … with 431 more rows
The tsibble
Object
Turning a tibble
(or a regular data.frame
) into a tsibble
is done with the as_tsibble()
function. Specify the index
argument as the name of your time index column. When printing a tsibble
, the top line will tell you:
- That it is a
tsibble
object - The dimension of the
tsibble
(rows by columns) - The frequency of the time index in brackets (below, 1D is a frequency of one day)
data_tbl
# A tibble: 5 × 2
time value
<date> <chr>
1 2020-01-01 a
2 2020-01-02 b
3 2020-01-03 c
4 2020-01-04 d
5 2020-01-05 e
data_tsbl
# A tsibble: 5 x 2 [1D]
time value
<date> <chr>
1 2020-01-01 a
2 2020-01-02 b
3 2020-01-03 c
4 2020-01-04 d
5 2020-01-05 e
Time Series Plots
Often times, simply plotting time series data is overlooked. It can be one of the most important steps in understanding your data well enough to produce a reasonable forecast. There are many methods the tidyverts
provides to visualize time series data. Most of those functions are found in the feasts
package.
Basic Time Series Plot
The most basic plot for time series data is an ordered line plot. This can easily be handled with the autoplot()
function. The output is a ggplot2
object, so you can add other layers such as labels and themes.
victoria_dept_stores_tsbl %>%
autoplot(Turnover) +
expand_limits(y = 0) +
theme_minimal() +
scale_y_continuous(
labels = scales::label_dollar(prefix = "$", suffix = "M")
) +
labs(
title = "Yearly Retail Turnover for Victoria, Australia",
subtitle = "Sourced from the tsibbledata package",
caption = "Original source: Australian Bureau of Statistics",
x = "",
y = "Australian Dollars"
)
To tease out possible seasonality, the gg_season()
function can be used. This plots an individual line plot for each “one-level-up” time period specified in your tsibble
. In the example below, the tsibble
has monthly data specified, so each line plot is a single year (a year is “one-level-up” from a month). This plot helps to see if there are seasonal patterns that repeat, for this example, in the same months every year.
victoria_dept_stores_tsbl %>%
gg_season(Turnover, labels = "right") +
expand_limits(y = 0) +
theme_minimal() +
scale_y_continuous(
labels = scales::label_dollar(prefix = "$", suffix = "M")
) +
labs(
title = "Seasonality of Yearly Retail Turnover for Victoria",
subtitle = "Generated with the 'gg_season()' function",
x = "",
y = "Australian Dollars"
)
Seasonal Subseries Plot
Another good way to examine seasonality is to fully separate out each season. This can be done with the gg_subseries()
function. As seen below, department store turnover (unsurprisingly) peaks each December. From the previous plot, we saw that this repeats every year in the data. Whereas the last plot plots out a line for each “one-step-up” time index, the plot below plots out a line for each time index. In these examples, the previous plot plotted a line for each year (across all months in those years), the plot below plots a line for each month (across all years, grouping on the month).
victoria_dept_stores_tsbl %>%
gg_subseries(Turnover) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous(
labels = scales::label_dollar(prefix = "$", suffix = "M"),
limits = c(0, NA)
) +
labs(
title = "Seasonality of Yearly Retail Turnover for Victoria",
subtitle = "Generated with the 'gg_subseries()' function",
x = "",
y = "Australian Dollars"
)
Lag Plot
Often we want to see how the current values of a time series relate to their past values. This can be done with the gg_lag()
function, which plots each value with its n-th lag. Below, we can see how the lags correlate across months. There is some correlation across the lags, but the 12th lag shows the strongest. This is not surprising, as we have already seen that there is yearly seasonality in this data.
victoria_dept_stores_tsbl %>%
gg_lag(Turnover, geom = "point", lags = 1:12) +
theme_bw() +
scale_x_continuous(
labels = scales::label_dollar(prefix = "$", suffix = "M"),
limits = c(0, NA),
breaks = c(0, 300, 600)
) +
scale_y_continuous(
labels = scales::label_dollar(prefix = "$", suffix = "M"),
limits = c(0, NA),
breaks = c(0, 300, 600)
) +
labs(
title = "Monthly Lagged Correlations of Retail Turnover for Victoria",
subtitle = "Generated with the 'gg_lag()' function",
x = "Australian Dollars (original)",
y = "Australian Dollars (lagged)"
)
Autocorrelation Plot
To further investigate, we can calculate and plot the autocorrelations. The autocorrelation function (ACF) calculates the correlation of the observations against various lags. This can be done using the ACF()
function along with the autoplot()
function.
Transformations
It is common in time series problems to apply some transformation to the data to stabilize its variance. The following plot shows a time series where the variance changes quite drastically over time.
victoria_sheep_tsbl %>%
autoplot(Count) +
theme_minimal() +
scale_y_continuous(
labels = scales::label_comma(scale = 1e-6, suffix = "M")
) +
labs(
title = "Quarterly Sheep Meat Production for Victoria",
subtitle = "Sourced from the tsibbledata package",
caption = "Original source: Australian Bureau of Statistics",
x = "Quarter",
y = "Count of Slaughtered Sheep"
)
Common transformations include:
- log
- square root
- inverse
victoria_sheep_tsbl %>%
rename(Original = Count) %>%
mutate(
Log = log(Original),
`Square Root` = sqrt(Original),
Inverse = 1 / Original
) %>%
pivot_longer(cols = Original:Inverse, names_to = "Transformation") %>%
as_tsibble(key = Transformation, index = Month) %>%
autoplot(value) +
facet_wrap(~ Transformation, ncol = 1, scales = "free_y") +
theme_bw() +
theme(legend.position = "none") +
labs(
title = "Quarterly Sheep Meat Production for Victoria",
subtitle = "Different transformations applied to original series",
x = "Quarter",
y = "(Transformation of the) Count of Slaughtered Sheep"
)
Some transformations seem to not stabilize the variance enough, while others seem to go too far.
Another common transformation methodology is a Box-Cox Transformation. The lambda
value controls the type and strength of the transformation. While there is not necessarily a widely accepted formal method for determining the optimal lambda value, the Guerrero method can often be used as a good starting point. We can implement this method by passing the guerrero()
function into the features()
function from the fabletools
package.
victoria_sheep_tsbl %>%
features(Count, guerrero) %>%
mutate(across(where(is.numeric), round, digits = 2)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Animal | State | lambda_guerrero |
---|---|---|
Sheep | Victoria | -0.21 |
While the Guerrero method implemented above results in a negative lambda value, Dr. Hyndman mentioned that he has never used a negative lambda and recommends that it be avoided.
Another option for finding a reasonable lambda value is to simply try several values of lambda and visually inspect the results. The Box-Cox transformed values can be calculated using the box_cox()
function. Below, we try a few common values of lambda and plot the results.
tibble(lambda = c(0, 0.5, 1, 2)) %>%
mutate(
victoria_sheep_transformed = map(
.x = lambda,
.f = ~ victoria_sheep_tsbl %>%
as_tibble() %>%
mutate(Count = box_cox(Count, .x))
)
) %>%
unnest(victoria_sheep_transformed) %>%
mutate(lambda = as.factor(lambda)) %>%
ggplot(aes(x = Month, y = Count, color = lambda)) +
geom_line() +
facet_wrap(~ lambda, ncol = 1, scales = "free_y") +
theme_bw() +
theme(legend.position = "none") +
labs(
title = "Box-Cox Transformations of Quarterly Sheep Meat Production",
subtitle = "Different values of lambda applied to original series",
x = "Quarter",
y = "(Transformation of the) Count of Slaughtered Sheep"
)
None of these values look that great.
To demonstrate what a successful variance stabilization would look like, we can look at what happens with the following data set:
victoria_food_retail_tsbl %>%
rename(Original = Turnover) %>%
mutate(
`Square Root` = Original ^ (1/2),
Log = log(Original)
) %>%
pivot_longer(
cols = -Month,
names_to = "Transformation",
names_ptypes = list(
Transformation = factor(levels = c("Original", "Square Root", "Log"))
)
) %>%
as_tsibble(index = Month, key = Transformation) %>%
autoplot(value) +
facet_wrap(~ Transformation, ncol = 1, scales = "free_y") +
theme_bw() +
theme(legend.position = "none") +
labs(
title = "Yearly Food Retail Turnover for Victoria, Australia",
subtitle = "Different transformations applied to the original series",
x = "Quarter",
y = "(Transformation of) Australian Dollars ($M)"
)
Recap
The first day of the workshop focused on all the foundation work to build up to forecasting. We covered:
- What a
tsibble
is - time series plots
- time series transformations
The next parts will document day 2 of the workshop. Day 2 consisted of:
- Building time series models
- Validating time series models
- Producing forecasts
- Determining forecast accuracy
Notes
This post is based on a presentation that was given on the date listed. It may be updated from time to time to fix errors, detail new functions, and/or remove deprecated functions so the packages and R version will likely be newer than what was available at the time.
The R session information used for this post:
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.1.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] kableExtra_1.3.4 ggplot2_3.4.0 fable_0.3.2 feasts_0.3.0
[5] fabletools_0.3.2 tsibbledata_0.4.1 tsibble_1.1.3 stringr_1.4.1
[9] lubridate_1.9.0 timechange_0.1.1 purrr_0.3.5 tidyr_1.2.1
[13] dplyr_1.0.10
loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 xfun_0.40 colorspace_2.0-3
[4] vctrs_0.6.3 generics_0.1.3 viridisLite_0.4.1
[7] htmltools_0.5.3 yaml_2.3.5 utf8_1.2.2
[10] rlang_1.1.1 pillar_1.8.1 glue_1.6.2
[13] withr_2.5.0 rappdirs_0.3.3 distributional_0.3.1
[16] lifecycle_1.0.3 munsell_0.5.0 anytime_0.3.9
[19] gtable_0.3.1 rvest_1.0.3 evaluate_0.16
[22] labeling_0.4.2 knitr_1.40 fastmap_1.1.0
[25] fansi_1.0.3 highr_0.9 Rcpp_1.0.9
[28] renv_0.16.0 scales_1.2.1 webshot_0.5.4
[31] jsonlite_1.8.0 systemfonts_1.0.4 farver_2.1.1
[34] digest_0.6.29 stringi_1.7.12 grid_4.2.1
[37] cli_3.6.1 tools_4.2.1 magrittr_2.0.3
[40] tibble_3.1.8 pkgconfig_2.0.3 ellipsis_0.3.2
[43] xml2_1.3.3 rmarkdown_2.16 svglite_2.1.0
[46] httr_1.4.4 rstudioapi_0.14 R6_2.5.1
[49] compiler_4.2.1