Practical Session 3: Managing date formats and plotting

Session inject

You have been provided with one MS Excel file (tsa_practice.xlsx) containing 2 sheets, one for each of two different diseases (dis1, dis2); and one csv file (tsa_pumala.csv) about Puumala virus infections in Finland.

Expected Learning Outcomes

By the end of the session, participants should be able to:

Manage surveillance datasets with different date formats
Create specific time series objects using tsibble
Plot surveillance data against time

Source code
Show the code# Install pacman if not installed already, and activate it
if (!require("pacman")) install.packages("pacman")

# Install, update and activate libraries
pacman::p_load(
  here, 
  rio, 
  skimr,
  tsibble,
  ISOweek,
  tidyverse
)

3.1 - Visualization exercise

Objective: to assess visually the reported number of cases of dis1 by ISO (epidemiological) week or calendar month for all the data provided. What diseases do you think dis1 is, judging from the cases’ distribution in time?

Import and explore data

Begin with importing your data to R from these Excel/CSV files and examine them using the well-known str, summary and skim functions

Show the code

dis1 <- import(here("data", "tsa_practice.xlsx"), which = "dis1")

Since we will work with time series data, we could use some time to explore the most relevant variable: date (in this case, the year variable). dis1 is a dataset containing weekly counts of a certain disease between 1981 and 1989. A typical year has 52 weeks, thus we can count the number of times each year appear in the data using table, tabyl or count function, your go-to tools for categorical variables

Show the code

table(dis1$year)


1981 1982 1983 1984 1985 1986 1987 1988 1989 
  52   52   52   52   52   52   52   52   15

Show the code

janitor::tabyl(dis1$year)

 dis1$year  n    percent
      1981 52 0.12064965
      1982 52 0.12064965
      1983 52 0.12064965
      1984 52 0.12064965
      1985 52 0.12064965
      1986 52 0.12064965
      1987 52 0.12064965
      1988 52 0.12064965
      1989 15 0.03480278

Each one provides different outputs and possesses unique capabilities. In general, tabyl should be your preferred basic function, thanks to its tidy integration and format output. It also enables some customization and data manipulation.

Time series data format

R has a collection of packages for handling time series data. Some of these packages are part of the tidyverse family of packages. Particularly, it includes the tsibble package, which intends to create a data infrastructure for easier manipulation and handling of temporal data, and adapts the principles of tidy data).

According to the help description, a tsibble object is defined by

An index, as a variable with inherent ordering from past to present. Mandatory for a ts object
A key, as a set of variables that define observational units over time, i.e. region, state, age group, etc.. Optional for a ts object
Uniquely identified observations by an index and a key (if defined).

To convert the original dis1 dataset to a tsibble object, we use the as_tsibble() function and specify the index, i.e. the variable specifying the time unit of interest (year and week variables in our case). We are not using key variables for now. Variables that are not used for index or key purposes, such as the cases variables, are considered as measured variables.

To define the index, we can make use of the tsibble built-in functions such as make_yearweek (or yearmonth, yearquarter, etc., see documentation)

Show the code

dis1_ts <- dis1 %>%
  mutate(date_index = make_yearweek(year = year, week = week)) %>%
  as_tsibble(index = date_index)

head(dis1_ts, 5)

# A tsibble: 5 x 4 [1W]
   year  week cases date_index
  <dbl> <dbl> <dbl>     <week>
1  1981     1     1   1981 W01
2  1981     2     3   1981 W02
3  1981     3    NA   1981 W03
4  1981     4     4   1981 W04
5  1981     5     3   1981 W05

You can see how the number of dis1 cases is distributed in time using the ggplot package.

Show the code

ggplot(data = dis1_ts) +
    geom_line(mapping = aes(x = date_index, y = cases)) +
    scale_x_yearweek(date_labels = "%Y") +
    labs(x = "Date", y = "Number of Cases", title = "Disease 1 data") +
    theme_bw() + 
        theme(
            plot.title = element_text(face = "bold", 
                                      size = 12),
            legend.background = element_rect(fill = "white", 
                                             size = 4, 
                                             colour = "white"),
            # legend.justification = c(0, 1),
            legend.position = "bottom",
            panel.grid.minor = element_blank()
        )

One tip: if you design ggplot themes through more complex theme() customizations, you can save it in an object and later use it in your plots. This way, you only need to write the code once which helps keeping the consistency across your entire plots

Show the code

# We can save theme modifications into a single object and then use it in new plots
tsa_theme <- theme_bw() + 
        theme(
            plot.title = element_text(face = "bold", 
                                      size = 12),
            legend.background = element_rect(fill = "white", 
                                             size = 4, 
                                             colour = "white"),
            # legend.justification = c(0, 1),
            legend.position = "bottom",
            panel.grid.minor = element_blank()
        )

Note from the plot that there are missing values in the data.

If data points are collected at regular time interval, we can correct missing data by providing a default value or interpolating missing values from other data. The tsibble package has the fill_gaps function, which fills missing values with a pre-specified default value. It is quite common to replaces NAs with its previous observation for each time point in a time series analysis, which is easily done using the fill function from tidyr package.

A quick overview of implicit missing values with tsibble is available on vignette("implicit-na").

Note

For time series data visualization, there are specific packages and functions you can use. As the focus of this training is on understanding principles of time series analysis rather than visualisation of time series, we will mainly use the ggplot functions for the remaining exercises (also so you get to practise more complex plots)

Pumala data

Let’s now work with the Pumala data

Here you have one variable for the year, one variable for the month and one variable with the complete date in days, but in a text (chr class) format.

With the complete date, you can generate ISO weeks, but first you need the date variable to be converted from text to something R recognises as a date. The lubridate package has a number of convenient functions for converting text strings to dates.

Show the code

dis3 <- dis3 %>%
  mutate(my_date = dmy(date_str))

str(dis3$my_date)

 Date[1:3844], format: "1995-01-02" "1995-01-02" "1995-01-03" "1995-01-03" "1995-01-04" ...

As a complementary note, check the help for strftime to learn about different date formats in R (?strftime).

To convert to ISO week, we can use the ISOweek function from the ISOweek package, which creates a new variable representing the ISO week. We could then also create a new variable representing the first Monday of each ISO week. Although the popular lubridate package has an isoweek function (there is also a similar function in the surveillance package), we use the ISOweek package here as it has the ISOweek2date function. If epidemiological weeks are required, use the EpiWeek package.

The paste command concatenates text.

Here we are adding "-01" onto the end of the ISO week variable (which is formatted something like "1995-W01"), to indicate that we want the first day of that week, and then supplying that to the ISOweek2date function, which converts that to a date.

Have a look at the new variables that have been created. date_isowk is the ISO week variable, in string format, and isodate is the Monday of each ISO week. Note that the years 1998 and 2004 each have an ISO week 53.

You have several observations in the same week since you have data from different days and one value corresponds to one case. Use the count function to aggregate the data.

Show the code

dis3_v2 <- dis3 %>%
    count(date_isowk)

head(dis3_v2)

  date_isowk  n
1   1995-W01 13
2   1995-W02 15
3   1995-W03  6
4   1995-W04  5
5   1995-W05 11
6   1995-W06 10

Show the code

dis3_ts <- dis3_v2 %>%
    mutate(date_index = yearweek(x = date_isowk, week_start = 1L)) %>%
    as_tsibble(index = date_index)


ggplot(data = dis3_ts) +
    geom_line(mapping = aes(x = date_index, y = n)) +
    scale_x_yearweek(date_labels = "%Y") +
    labs(x = "Date", 
         y = "Number of Cases", 
         title = "Disease 1 data") +
    tsa_theme

Aggregating cases by month is another possibility.

Show the code

dis3_agg <- dis3 %>%
    count(year, month)

dis3_ts_v2 <- dis3_agg %>%
    mutate(date_index = make_yearmonth(year = year, month = month)) %>%
    as_tsibble(index = date_index)

ggplot(data = dis3_ts_v2) +
    geom_line(mapping = aes(x = date_index, y = n)) +
    scale_x_yearweek(date_labels = "%Y") +
    labs(x = "Date", 
         y = "Number of Cases", 
         title = "Disease 1 data") +
    tsa_theme

3.2 Optional task

Import your data to R from the dis2 excel file.

Show the code

dis2 <- import(here("data", "tsa_practice.xlsx"), which = "dis2")
View(dis2)

Inspect the data.

You have separate columns containing the counts. To plot this data, you need to first reshape your dataset by converting it from the current wide format to a long format.

The function pivot_longer can perform such transformation.

Show the code

dis2l <- dis2 %>%
  pivot_longer(
      cols = -year, 
      names_to = "month", 
      values_to = "case"
  ) %>%
  mutate(month = as_factor(month)) %>%  # as_factor sets levels in the order they appear
  arrange(year, month)

str(dis2l)

tibble [528 × 3] (S3: tbl_df/tbl/data.frame)
 $ year : num [1:528] 1928 1928 1928 1928 1928 ...
 $ month: Factor w/ 12 levels "Jan","Feb","Mar",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ case : num [1:528] 609 1516 4952 7466 11155 ...

Show the code

head(dis2l)

# A tibble: 6 × 3
   year month  case
  <dbl> <fct> <dbl>
1  1928 Jan     609
2  1928 Feb    1516
3  1928 Mar    4952
4  1928 Apr    7466
5  1928 May   11155
6  1928 Jun    7002

Show the code

dis2l_agg <- dis2l %>%
    mutate(date_index = make_yearmonth(year = year, month = month)) %>%
    as_tsibble(index = date_index)

head(dis2l_agg)

# A tsibble: 6 x 4 [1M]
   year month  case date_index
  <dbl> <fct> <dbl>      <mth>
1  1928 Jan     609  1928 ene.
2  1928 Feb    1516  1928 feb.
3  1928 Mar    4952  1928 mar.
4  1928 Apr    7466  1928 abr.
5  1928 May   11155  1928 may.
6  1928 Jun    7002  1928 jun.

Show the code

ggplot(data = dis2l_agg) +
    geom_line(mapping = aes(x = date_index, y = case)) +
    scale_x_yearweek(date_labels = "%Y") +
    labs(x = "Date", 
         y = "Number of Cases", 
         title = "Disease 3 data") +
    tsa_theme

dis1 corresponds to salmonellosis cases, dis2 to measles cases in New York.

# Practical Session 3: Managing date formats and plotting {.unnumbered} ## Session inject {#learning-3 .unnumbered} You have been provided with one MS Excel file (`tsa_practice.xlsx`) containing 2 sheets, one for each of two different diseases (**dis1**, **dis2**); and one csv file (`tsa_pumala.csv`) about Puumala virus infections in Finland. ### Expected Learning Outcomes By the end of the session, participants should be able to: - Manage surveillance datasets with different date formats - Create specific time series objects using `tsibble` - Plot surveillance data against time ### Source code ```{r task-3-1-source-Rscript, message=FALSE, warning=FALSE} # Install pacman if not installed already, and activate it if (!require("pacman")) install.packages("pacman") # Install, update and activate libraries pacman::p_load( here, rio, skimr, tsibble, ISOweek, tidyverse ) ``` ## 3.1 - Visualization exercise - **Objective**: to assess visually the reported number of cases of dis1 by ISO (epidemiological) week or calendar month for all the data provided. What diseases do you think dis1 is, judging from the cases' distribution in time? ### Import and explore data Begin with importing your data to R from these Excel/CSV files and examine them using the well-known `str`, `summary` and `skim` functions ```{r, task-3-1-import-xls} dis1 <- import(here("data", "tsa_practice.xlsx"), which = "dis1") ``` ```{r task-3-1-str-dis1} #| eval: false #| include: false str(dis1) summary(dis1) skim(dis1) %>% as_tibble() ``` Since we will work with time series data, we could use some time to explore the most relevant variable: **date** (in this case, the `year` variable). `dis1` is a dataset containing weekly counts of a certain disease between 1981 and 1989. A typical year has 52 weeks, thus we can count the number of times each year appear in the data using `table`, `tabyl` or `count` function, your go-to tools for categorical variables ```{r, task-3-1-table-dis1} table(dis1$year) ``` ```{r} janitor::tabyl(dis1$year) ``` Each one provides different outputs and possesses unique capabilities. In general, `tabyl` should be your preferred basic function, thanks to its tidy integration and format output. It also enables some customization and data manipulation. ### Time series data format R has a collection of packages for handling time series data. Some of these packages are part of the `tidyverse` family of packages. Particularly, it includes the `tsibble` package, which intends to create a data infrastructure for easier manipulation and handling of temporal data, and adapts the [principles of tidy data](https://tidyr.tidyverse.org/articles/tidy-data.html)). According to the help description, a `tsibble` object is defined by - An `index`, as a variable with inherent ordering from past to present. Mandatory for a ts object - A `key`, as a set of variables that define observational units over time, i.e. region, state, age group, etc.. Optional for a ts object - Uniquely identified observations by an index and a key (if defined). To convert the original `dis1` dataset to a `tsibble` object, we use the `as_tsibble()` function and specify the `index`, i.e. the variable specifying the time unit of interest (`year` and `week` variables in our case). We are not using `key` variables for now. Variables that are not used for index or key purposes, such as the `cases` variables, are considered as measured variables. To define the index, we can make use of the `tsibble` built-in functions such as `make_yearweek` (or yearmonth, yearquarter, etc., see documentation) ```{r, task-3-1-tsibble-week-base} dis1_ts <- dis1 %>% mutate(date_index = make_yearweek(year = year, week = week)) %>% as_tsibble(index = date_index) head(dis1_ts, 5) ``` You can see how the number of dis1 cases is distributed in time using the `ggplot` package. ```{r, task-3-1-plot-tsibble-week-tidy} ggplot(data = dis1_ts) + geom_line(mapping = aes(x = date_index, y = cases)) + scale_x_yearweek(date_labels = "%Y") + labs(x = "Date", y = "Number of Cases", title = "Disease 1 data") + theme_bw() + theme( plot.title = element_text(face = "bold", size = 12), legend.background = element_rect(fill = "white", size = 4, colour = "white"), # legend.justification = c(0, 1), legend.position = "bottom", panel.grid.minor = element_blank() ) ``` **One tip**: if you design ggplot themes through more complex `theme()` customizations, you can save it in an object and later use it in your plots. This way, you only need to write the code once which helps keeping the consistency across your entire plots ```{r} # We can save theme modifications into a single object and then use it in new plots tsa_theme <- theme_bw() + theme( plot.title = element_text(face = "bold", size = 12), legend.background = element_rect(fill = "white", size = 4, colour = "white"), # legend.justification = c(0, 1), legend.position = "bottom", panel.grid.minor = element_blank() ) ``` Note from the plot that there are missing values in the data. If data points are collected at regular time interval, we can correct missing data by providing a default value or interpolating missing values from other data. The `tsibble` package has the `fill_gaps` function, which fills missing values with a pre-specified default value. It is quite common to replaces `NA`s with its previous observation for each time point in a time series analysis, which is easily done using the `fill` function from `tidyr` package. A quick overview of implicit missing values with `tsibble` is available on `vignette("implicit-na")`. ::: callout-note For time series data visualization, there are specific packages and functions you can use. As the focus of this training is on understanding principles of time series analysis rather than visualisation of time series, we will mainly use the `ggplot` functions for the remaining exercises (also so you get to practise more complex plots) ::: ### Pumala data Let's now work with the Pumala data ```{r task-3-1-dis3-import} #| include: false dis3 <- import(here("data", "tsa_pumala.csv")) view(dis3) ``` Here you have one variable for the year, one variable for the month and one variable with the complete date in days, but in a text (`chr` class) format. With the complete date, you can generate ISO weeks, but first you need the date variable to be converted from text to something R recognises as a date. The `lubridate` package has a number of convenient functions for converting text strings to dates. ```{r, task-3-1-dis3-date-format-base} dis3 <- dis3 %>% mutate(my_date = dmy(date_str)) str(dis3$my_date) ``` As a complementary note, check the help for `strftime` to learn about different date formats in R (`?strftime`). To convert to ISO week, we can use the `ISOweek` function from the `ISOweek` package, which creates a new variable representing the ISO week. We could then also create a new variable representing the first Monday of each ISO week. Although the popular `lubridate` package has an `isoweek` function (there is also a similar function in the `surveillance` package), we use the `ISOweek` package here as it has the `ISOweek2date` function. If epidemiological weeks are required, use the `EpiWeek` package. ```{r task-3-1-dis3-isoweek-base} #| include: false dis3 <- dis3 %>% mutate( date_isowk = ISOweek(my_date), isodate = ISOweek2date(paste(date_isowk, "-1", sep = "")) ) view(dis3) ``` The `paste` command concatenates text. Here we are adding `"-01"` onto the end of the ISO week variable (which is formatted something like `"1995-W01"`), to indicate that we want the first day of that week, and then supplying that to the `ISOweek2date` function, which converts that to a date. Have a look at the new variables that have been created. `date_isowk` is the ISO week variable, in string format, and `isodate` is the Monday of each ISO week. Note that the years 1998 and 2004 each have an ISO week 53. You have several observations in the same week since you have data from different days and one value corresponds to one case. Use the `count` function to aggregate the data. ```{r, task-3-1-dis3-isoweek-aggr-base} dis3_v2 <- dis3 %>% count(date_isowk) head(dis3_v2) ``` ```{r, task-3-1-dis3-isoweek-plot-tidy} dis3_ts <- dis3_v2 %>% mutate(date_index = yearweek(x = date_isowk, week_start = 1L)) %>% as_tsibble(index = date_index) ggplot(data = dis3_ts) + geom_line(mapping = aes(x = date_index, y = n)) + scale_x_yearweek(date_labels = "%Y") + labs(x = "Date", y = "Number of Cases", title = "Disease 1 data") + tsa_theme ``` Aggregating cases by month is another possibility. ```{r, task-3-1-dis3-month-plot-tidy} dis3_agg <- dis3 %>% count(year, month) dis3_ts_v2 <- dis3_agg %>% mutate(date_index = make_yearmonth(year = year, month = month)) %>% as_tsibble(index = date_index) ggplot(data = dis3_ts_v2) + geom_line(mapping = aes(x = date_index, y = n)) + scale_x_yearweek(date_labels = "%Y") + labs(x = "Date", y = "Number of Cases", title = "Disease 1 data") + tsa_theme ``` ## 3.2 Optional task {#solution-3-1-1 .unnumbered} Import your data to R from the dis2 excel file. ```{r, task-3-1-opt-dis2-import} dis2 <- import(here("data", "tsa_practice.xlsx"), which = "dis2") View(dis2) ``` Inspect the data. ```{r task-3-1-opt-dis2-inspect} #| include: false str(dis2) head(dis2) summary(dis2) ``` You have separate columns containing the counts. To plot this data, you need to first reshape your dataset by converting it from the current wide format to a long format. The function `pivot_longer` can perform such transformation. ```{r, task-3-1-opt-dis2-pivot-longer-tidy} dis2l <- dis2 %>% pivot_longer( cols = -year, names_to = "month", values_to = "case" ) %>% mutate(month = as_factor(month)) %>% # as_factor sets levels in the order they appear arrange(year, month) str(dis2l) head(dis2l) dis2l_agg <- dis2l %>% mutate(date_index = make_yearmonth(year = year, month = month)) %>% as_tsibble(index = date_index) head(dis2l_agg) ggplot(data = dis2l_agg) + geom_line(mapping = aes(x = date_index, y = case)) + scale_x_yearweek(date_labels = "%Y") + labs(x = "Date", y = "Number of Cases", title = "Disease 3 data") + tsa_theme ``` **dis1** corresponds to salmonellosis cases, **dis2** to measles cases in New York.