Practical Session 4: Smoothing and Trends

Show the code
rm(list=ls())
# Install pacman if not installed already, and activate it
if (!require("pacman")) install.packages("pacman")

# Install, update and activate libraries
pacman::p_load(
  here, 
  rio, 
  skimr,
  tsibble,
  ISOweek,
  slider,    # for rolling means
  pander,
  tidyverse,
  gtsummary
)


# Create tsa_theme
tsa_theme <- theme_bw() + 
        theme(
            plot.title = element_text(face = "bold", 
                                      size = 12),
            legend.background = element_rect(fill = "white", 
                                             size = 4, 
                                             colour = "white"),
            legend.justification = c(0, 1),
            legend.position = "bottom",
            panel.grid.minor = element_blank()
        )

Mortality Surveillance data in Spain

The number of deaths in a country can be used as a surveillance indicator. An example of such an application is EuroMOMO, which monitors number of fatalities in European countries on a weekly basis (https://www.euromomo.eu/). The weekly time series on the number of deathls are routinely used to follow seasonal infections, for example influenza and covid 19 during the winter, but also heatwaves in the summer.

In case studies 4 to 7, we will analyse the number of deaths retrieved from the Spanish national statistical institute (INE) (https://ine.es/). Each case study will build up from the analysis conducted in the previous case study. The data contains the number of fatalities per week for the entire country, from 2010 to 2019, and includes variables on year, week, total number of deaths (cases), and total population; the same data is also included for men and women.

Expected learning outcomes

By the end of the session, participants should be able to:

  • Describe, test and fit a trend in surveillance data (simple smoothing and regression);

  • Assess and interpret the significance of trend in surveillance data.

Save any changes to a new dataset named `mortagg.

Task 4.1

Assess visually how the total registered number of deaths has been behaving in the years with available data.

Discuss your results with your peers.

Moving average and direct statistical modelling of trends

Moving averages are simple methods to visualise the general trend of a series after removing some of the random day-to-day variation by smoothing the data. This allows you to browse your data for periodicity and observe the general trend. In other words, smoothing the data may remove “noise” from your time series and can facilitate visual interpretation.

Moving averages model a time series by calculating the numerical mean of the values adjacent to it. It is calculated for each observation, moving along the time axis: e.g. at each time \(t\) and for a window of 5 time units, one way of calculating the moving average is by using the observations at \(t-2\), \(t-1\), \(t\), \(t+1\) and \(t+2\).

Help for Task 4.1

Load the mortagg.Rdata dataset.

Show the code
load(here("data", "mortagg2.Rdata"))

Inspect the data.

Show the code
str(mortagg)
'data.frame':   521 obs. of  8 variables:
 $ cases_m: num  4207 4409 4229 4252 4196 ...
 $ cases_f: num  4019 4264 4122 3837 3976 ...
 $ cases  : num  8226 8673 8351 8089 8172 ...
 $ year   : num  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ week   : num  1 2 3 4 5 6 7 8 9 10 ...
 $ pop    : num  46486621 46486621 46486621 46486621 46486621 ...
 $ pop_f  : num  23504349 23504349 23504349 23504349 23504349 ...
 $ pop_m  : num  2.3e+07 2.3e+07 2.3e+07 2.3e+07 2.3e+07 ...
Show the code
summary(mortagg)
    cases_m        cases_f         cases            year           week      
 Min.   :3276   Min.   :2941   Min.   : 6316   Min.   :2010   Min.   : 1.00  
 1st Qu.:3648   1st Qu.:3464   1st Qu.: 7106   1st Qu.:2012   1st Qu.:14.00  
 Median :3839   Median :3684   Median : 7510   Median :2015   Median :27.00  
 Mean   :3969   Mean   :3819   Mean   : 7788   Mean   :2015   Mean   :26.55  
 3rd Qu.:4194   3rd Qu.:4047   3rd Qu.: 8217   3rd Qu.:2017   3rd Qu.:40.00  
 Max.   :5597   Max.   :5877   Max.   :11471   Max.   :2019   Max.   :53.00  
      pop               pop_f              pop_m         
 Min.   :46418884   Min.   :23504349   Min.   :22806445  
 1st Qu.:46486621   1st Qu.:23612439   1st Qu.:22831107  
 Median :46497393   Median :23621034   Median :22889600  
 Mean   :46608292   Mean   :23669979   Mean   :22938313  
 3rd Qu.:46712650   3rd Qu.:23719207   3rd Qu.:23016783  
 Max.   :46918951   Max.   :23902168   Max.   :23099009  
Show the code
view(mortagg)
Show the code
skimr::skim(mortagg) 
Data summary
Name mortagg
Number of rows 521
Number of columns 8
_______________________
Column type frequency:
numeric 8
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
cases_m 0 1 3969.15 454.44 3276 3648 3839 4194 5597 ▇▇▃▁▁
cases_f 0 1 3818.53 544.44 2941 3464 3684 4047 5877 ▆▇▃▁▁
cases 0 1 7787.68 992.14 6316 7106 7510 8217 11471 ▇▇▃▁▁
year 0 1 2014.50 2.87 2010 2012 2015 2017 2019 ▇▇▇▇▇
week 0 1 26.55 15.05 1 14 27 40 53 ▇▇▇▇▇
pop 0 1 46608291.50 163065.99 46418884 46486621 46497393 46712650 46918951 ▇▁▅▂▂
pop_f 0 1 23669978.80 102454.81 23504349 23612439 23621034 23719207 23902168 ▂▇▆▂▂
pop_m 0 1 22938312.61 100399.88 22806445 22831107 22889600 23016783 23099009 ▇▅▁▇▅

Create a time series object with tsibble, using the total case counts and plot the time series.

Show the code
mortz <-
  mortagg %>%
  mutate(date_index = make_yearweek(year = year, week = week)) %>%
  as_tsibble(index = date_index)


ggplot(data = mortz) +
  geom_line(mapping = aes(x = date_index, y = cases)) +
  scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
  labs(x = "Year Week", y = "Number of Deaths", title = "Spain: number of deaths per week") +
  tsa_theme

Show the code
## Adjusting y-axis scale
ggplot(data = mortz) +
  geom_line(mapping = aes(x = date_index, y = cases)) +
  scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Date", y = "Number of Deaths", title = "Spain: number of deaths per week") +
  tsa_theme

Create a similar plot for men and women.

Show the code
## Men and Women in wide format

colors <- c("cases_m"="green", "cases_f"="black")

ggplot(data = mortz, aes(x = date_index)) +
  geom_line(mapping = aes(y = cases_m, colour = "cases_m"), lwd=1) +
  geom_line(mapping = aes(y = cases_f, colour = "cases_f"), lwd=1) + 
  scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Week", y = "Number of Deaths", title = "Spain: number of deaths per week and by sex", 
         colour = "Legend") +
  scale_color_manual(values = colors, name="", labels = c("Men", "Women")) +
  tsa_theme

Task 4.2

Assess visually the existence of a long-term trend and periodicity in the mortaagg dataset by using smoothing techniques.

Consider: where would you centre the smoothing window and why? What is the effect of using different smoothing window centring options? Test!

Of all the methods applied here (moving average with different windows, loess with different spans), which one do you think is the best for eliminating seasonality? Where would you centre the smoothing window and why? What would happen if you used an even greater window? Test!

Help for Task 4.2

According to the slider package description, slider is a package for rolling windows analysis. It means that a given function is repeatedly applied to different “windows” of your data as you step through it. Typical examples of applications of rolling window functions include moving averages or cumulative sums. An introduction vignette for slider can be found by running the following command: vignette("slider")

For each record, create the following various types of moving average.

  • MA5a: the 5-week moving average, centered on cases.

  • MA5b: the 5-week moving average of cases and the 4 previous weeks.

  • MA5c: the 5-week moving average of the 5 previous weeks.

Compare results.

Show the code
mortzma <-
  mortz %>%
  mutate(
    MA5a = slide_index_dbl(
      .x = cases,
      .i = date_index,
      .f = mean,
      na.rm = TRUE,
      .before = 2,
      .after = 2,
      .complete = TRUE
    ),
    MA5b = slide_index_dbl(
      .x = cases,
      .i = date_index,
      .f = mean,
      na.rm = TRUE,
      .before = 4,
      .complete = TRUE
    ),
    MA5c = slide_index_dbl(
      .x = cases,
      .i = date_index,
      .f = function(x) mean(x[-6], na.rm = TRUE),
      .before = 5,
      .complete = TRUE
    )
  )

# view first 10 lines of data
head(mortzma, 10)
# A tsibble: 10 x 12 [1W]
   cases_m cases_f cases  year  week    pop  pop_f  pop_m date_index  MA5a  MA5b
     <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>     <week> <dbl> <dbl>
 1    4207    4019  8226  2010     1 4.65e7 2.35e7 2.30e7   2010 W01   NA    NA 
 2    4409    4264  8673  2010     2 4.65e7 2.35e7 2.30e7   2010 W02   NA    NA 
 3    4229    4122  8351  2010     3 4.65e7 2.35e7 2.30e7   2010 W03 8302.   NA 
 4    4252    3837  8089  2010     4 4.65e7 2.35e7 2.30e7   2010 W04 8253.   NA 
 5    4196    3976  8172  2010     5 4.65e7 2.35e7 2.30e7   2010 W05 8183  8302.
 6    4127    3854  7981  2010     6 4.65e7 2.35e7 2.30e7   2010 W06 8131. 8253.
 7    4294    4028  8322  2010     7 4.65e7 2.35e7 2.30e7   2010 W07 8015  8183 
 8    4121    3968  8089  2010     8 4.65e7 2.35e7 2.30e7   2010 W08 7887. 8131.
 9    3824    3687  7511  2010     9 4.65e7 2.35e7 2.30e7   2010 W09 7853. 8015 
10    3921    3612  7533  2010    10 4.65e7 2.35e7 2.30e7   2010 W10 7710. 7887.
# ℹ 1 more variable: MA5c <dbl>

The slide_index_dbl is a slider package function designed to run a pre-specified function on a rolling window of a numeric (_dbl) variable, ordered by an index time (date or date-time) variable. A full description of this function can be found by running the following command: ?slide_index_dbl

The .x argument provides the vector with the numbers that will be used for computation.

The .i argument defines the vector with the time variable that will be used for ordering the data.

The .f argument indicates the function that will be used to perform the intended computations. In this case, an arithmetic mean computation is carried out by using the mean function. The na.rm = TRUE is a varying argument included as a ... argument of slide_index_dbl. (Run the expression args("slide_index_dbl") and take notice on the position/sequence of this function’s arguments).

The .before argument defines the number of observations before the central time point of a given time window to use in the rolling window computation. In the example above for a 5 time points centered moving average, in MA5a, 2 observations are used before the central point, and 2 observations are used after it, using the .after argument.

The complete argument indicates where the computation should be carried in rolling windows that have complete observations.

We applied a more complicated function to compute MA5c, taking a backwards moving window of six values but omitting the latest value (current time point) from the calculation of the average.

Show the code
## wide format dataset ---

colors <- c("cases"="black", "MA5a"="red", "MA5b"="blue", "MA5c"="brown")

ggplot(data = mortzma, mapping = aes(x = date_index)) +
  geom_line(mapping = aes(y = cases, colour = "cases")) +
  geom_line(mapping = aes(y = MA5a, colour = "MA5a")) +
  geom_line(mapping = aes(y = MA5b, colour = "MA5b")) +
  geom_line(mapping = aes(y = MA5c, colour = "MA5c")) +
  scale_x_yearweek(date_labels = "%Y-%W", 
                   date_breaks = "1 year") +
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Week", 
       y = "Number of deaths", 
       title = "Spain: number of deaths - moving averages",
       colour="Legend") +
  scale_color_manual(values = colors, name="") +
  tsa_theme

We observe that the calculation is similar across these various methods, but is not aligned to the series in the same way. MA5a is centered in the middle of the period used to calculate the mean. MA5b is placed at the end of the period. MA5c is placed one step forward (smoothing functions can be used for forecasting the following point). The “models” provided are similar for a 5-week window, but the lag is different.

Moving average is only one way of smoothing. Other ways of smoothing the data to get a general idea of the trend include, for example, LOESS (locally estimated scatterplot smoothing) smoothing, where the contribution of surrounding observations is weighted, i.e. it is not the arithmetical mean for each set (window) of observations.

Why do you think we have chosen these windows for the moving average?

To better observe the general trend, we need to find the length of the moving average that will erase the seasonal component. Various lengths can be tried; here we have used 25, 51 and 103.

Show the code
mortzma2 <-
  mortz %>%
  mutate(
    MA25 = slide_index_dbl(
      .x = cases,
      .i = date_index,
      .f = mean,
      na.rm = TRUE,
      .before = 12,
      .after = 12,
      .complete = TRUE
    ),
    MA51 = slide_index_dbl(
      .x = cases,
      .i = date_index,
      .f = mean,
      na.rm = TRUE,
      .before = 25,
      .after=25,
      .complete = TRUE
    ),
    MA103 = slide_index_dbl(
      .x = cases,
      .i = date_index,
      .f = mean,
      na.rm = TRUE,
      .before = 51,
      .after = 51,
      .complete = TRUE
    )
  )

## Visually inspect data
slice(mortzma2, 20:30)
# A tsibble: 11 x 12 [1W]
   cases_m cases_f cases  year  week    pop  pop_f  pop_m date_index  MA25  MA51
     <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>     <week> <dbl> <dbl>
 1    3867    3436  7303  2010    20 4.65e7 2.35e7 2.30e7   2010 W20 7181.   NA 
 2    3758    3446  7204  2010    21 4.65e7 2.35e7 2.30e7   2010 W21 7121.   NA 
 3    3605    3595  7200  2010    22 4.65e7 2.35e7 2.30e7   2010 W22 7095.   NA 
 4    3440    3076  6516  2010    23 4.65e7 2.35e7 2.30e7   2010 W23 7057.   NA 
 5    3407    3124  6531  2010    24 4.65e7 2.35e7 2.30e7   2010 W24 7003.   NA 
 6    3632    3201  6833  2010    25 4.65e7 2.35e7 2.30e7   2010 W25 6954.   NA 
 7    3690    3567  7257  2010    26 4.65e7 2.35e7 2.30e7   2010 W26 6925. 7303.
 8    3764    3625  7389  2010    27 4.65e7 2.35e7 2.30e7   2010 W27 6899. 7309.
 9    3697    3638  7335  2010    28 4.65e7 2.35e7 2.30e7   2010 W28 6879. 7314.
10    3570    3349  6919  2010    29 4.65e7 2.35e7 2.30e7   2010 W29 6860. 7316.
11    3476    3209  6685  2010    30 4.65e7 2.35e7 2.30e7   2010 W30 6854. 7323.
# ℹ 1 more variable: MA103 <dbl>
Show the code
slice(mortzma2, 45:55)
# A tsibble: 11 x 12 [1W]
   cases_m cases_f cases  year  week    pop  pop_f  pop_m date_index  MA25  MA51
     <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>     <week> <dbl> <dbl>
 1    3771    3563  7334  2010    45 4.65e7 2.35e7 2.30e7   2010 W45 7476. 7405.
 2    3817    3561  7378  2010    46 4.65e7 2.35e7 2.30e7   2010 W46 7562. 7400.
 3    4049    3516  7565  2010    47 4.65e7 2.35e7 2.30e7   2010 W47 7628  7394.
 4    4001    3794  7795  2010    48 4.65e7 2.35e7 2.30e7   2010 W48 7691. 7395.
 5    4058    3754  7812  2010    49 4.65e7 2.35e7 2.30e7   2010 W49 7746. 7401.
 6    4034    3668  7702  2010    50 4.65e7 2.35e7 2.30e7   2010 W50 7812. 7404.
 7    4061    3892  7953  2010    51 4.65e7 2.35e7 2.30e7   2010 W51 7866. 7406.
 8    4484    4056  8540  2010    52 4.65e7 2.35e7 2.30e7   2010 W52 7900. 7402.
 9    4593    4315  8908  2011     1 4.67e7 2.36e7 2.30e7   2011 W01 7927. 7403.
10    4276    4194  8470  2011     2 4.67e7 2.36e7 2.30e7   2011 W02 7962. 7395.
11    4271    4175  8446  2011     3 4.67e7 2.36e7 2.30e7   2011 W03 7962. 7394.
# ℹ 1 more variable: MA103 <dbl>
Show the code
slice(mortzma2, 95:105)
# A tsibble: 11 x 12 [1W]
   cases_m cases_f cases  year  week    pop  pop_f  pop_m date_index  MA25  MA51
     <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>     <week> <dbl> <dbl>
 1    3824    3513  7337  2011    43 4.67e7 2.36e7 2.30e7   2011 W43 7490. 7772.
 2    3818    3643  7461  2011    44 4.67e7 2.36e7 2.30e7   2011 W44 7571. 7784.
 3    3737    3698  7435  2011    45 4.67e7 2.36e7 2.30e7   2011 W45 7669. 7791.
 4    3789    3644  7433  2011    46 4.67e7 2.36e7 2.30e7   2011 W46 7789. 7806.
 5    3744    3602  7346  2011    47 4.67e7 2.36e7 2.30e7   2011 W47 7958. 7807.
 6    3971    3708  7679  2011    48 4.67e7 2.36e7 2.30e7   2011 W48 8132. 7811.
 7    4025    3824  7849  2011    49 4.67e7 2.36e7 2.30e7   2011 W49 8289. 7817.
 8    4148    3882  8030  2011    50 4.67e7 2.36e7 2.30e7   2011 W50 8393. 7816.
 9    4354    4054  8408  2011    51 4.67e7 2.36e7 2.30e7   2011 W51 8494. 7805.
10    4476    4277  8753  2011    52 4.67e7 2.36e7 2.30e7   2011 W52 8546. 7789.
11    4425    4304  8729  2012     1 4.68e7 2.37e7 2.31e7   2012 W01 8595. 7797.
# ℹ 1 more variable: MA103 <dbl>

You can use View to examine the first rows of mortzma2.

Show the code
View(mortzma2)

Plot the times series with the different moving averages.

Show the code
## wide format dataset ---

colors <- c("cases"="black", "MA25"="red", "MA51"="blue", "MA103"="orange")

ggplot(data = mortzma2, mapping = aes(x = date_index)) +
  geom_line(mapping = aes(y = cases, colour = "cases"), linewidth=1) +
  geom_line(mapping = aes(y = MA25, colour = "MA25"), linewidth=1.2) +
  geom_line(mapping = aes(y = MA51, colour = "MA51"), linewidth=1.2) +
  geom_line(mapping = aes(y = MA103, colour = "MA103"), linewidth=1.2) +
  scale_x_yearweek(date_labels = "%Y-%W", 
                   date_breaks = "1 year") +
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Year-Week", 
       y = "Number of deaths", 
       title = "Spain: number of deaths - moving averages with 26, 51 and 103-week windows",
       colour="Legend") +
  scale_color_manual(values = colors, name="", breaks = names(colors)[c(1,2,3,4)]) +
  tsa_theme

Moving average is only one way of smoothing. Other ways of smoothing the data to get a general idea of the trend include, for example, LOESS (locally estimated scatterplot smoothing) smoothing, where the contribution of surrounding observations is weighted, i.e. it is not the arithmetical mean for each set (window) of observations.

The geom_smooth provides a smoothing curve using a LOESS method.

Show the code
ggplot(mortz, mapping = aes(x = date_index, y = cases)) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = TRUE) +
  scale_x_yearweek(date_labels = "%Y-%W", 
                   date_breaks = "1 year") +
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Date", 
       y = "Number of Deaths", 
       title = "Loess smoothing") +
  tsa_theme

We can change the degree of smoothing by adding a span option. The span controls the amount of smoothing for the default loess smoother; smaller numbers produce wigglier lines, and larger numbers produce smoother lines.

Show the code
ggplot(data = mortz, mapping = aes(x = date_index, y = cases)) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = TRUE, span = 0.1) +
  scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Date", y = "Number of Deaths", title = "Loess smoothing (span = 0.1)") +
  tsa_theme

In ggplot(), we can also use stat_smooth, which provides a smoothing curve using a LOESS method, and were wee can change the degree of smoothing by adding a span option. The span controls the amount of smoothing for the default loess smoother; smaller numbers produce wigglier lines, and larger numbers produce smoother lines.

Show the code
ggplot(data = mortz, mapping = aes(x = date_index, y = cases)) +
  geom_point(alpha = 0.5) +
  stat_smooth(se = TRUE, span=0.1) +
  scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Year-Week", y = "Number of deaths", 
    title = "Spain: number of deaths - loess smoothing (span = 0.1)") +
  tsa_theme

Test different values of span. Comment on the lines provided by the different smoothing windows. Which one do you think is the best for eliminating seasonality? What would happen if you used an even greater window?

Task 4.3

Assess statistically whether the registered number of deaths has been stable in the years available in your dataset.

Help for Task 4.3

Using regression methods against the time variable is a simple and familiar way to look at trends and test the slope. The kind of regression and interpretation you will use depends on the nature of your dependent variable (outcome), in this case, the number (count) of registered deaths. Time will be represented by a variable that simply enumerates the observations; in other words, time is a variable with values from 1 (the first observation in time; here week 1, 2010) to n (the last observation in time; here week 52, 2019).

Show the code
mortz <-
  mortz %>%
  mutate(index = seq.int(from = 1, to = nrow(.)))

mort_poissontrend <- glm(formula = cases ~ index,
                          family = "poisson",
                          data = mortz
)

Check model results.

Show the code
summary(mort_poissontrend)

Call:
glm(formula = cases ~ index, family = "poisson", data = mortz)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 8.910e+00  1.007e-03 8850.25   <2e-16 ***
index       1.895e-04  3.302e-06   57.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 62509  on 520  degrees of freedom
Residual deviance: 59215  on 519  degrees of freedom
AIC: 64841

Number of Fisher Scoring iterations: 4

The glm function fits a linear regression model to the log of the weekly number of deaths. You can use the names function to see the various components of the output of the glm function, as above. As mentioned in the previous session, the str function is also often useful for looking at the “structure” of the output of an R function. Summary provides a quick function for checking the results.

Variables in R models are specified using notation along the lines of response_variable ~ explanatory_variable_1 + explanatory_variable_2 etc. Check the material from the MVA module. A more detailed explanation of how to build model formulas in R can also be found in the Details section of the help page of the formula function: ?stats::formula.

summary, when given the results of fitting a regression model, will provide a summary of the fitted parameters (coefficients); in this case the parameter of index is the trend.

Identify and interpret the intercept and the trend. In a Poisson model on the log(y), the coefficient needs to be exponentiated in order to interpret it for the output y (that is, the original y without the log).

Plot the fitted values against the observed ones.

Show the code
colors <- c("observed"="black", "fitted"="blue")

ggplot(data = mortz, mapping = aes(x = date_index)) +
  geom_point(mapping = aes(y = cases, colour="observed"), alpha = 0.5) +
  geom_line(mapping = aes(y = fitted(mort_poissontrend), colour = "fitted"), linewidth=1.1) +
  scale_y_continuous(limits = c(0, NA)) +
  scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
  labs(x = "Year-Week", y = "Number of deaths", title = "Spain: number of deaths per week with fitted trend") +
  scale_color_manual(values = colors, name="") +
  tsa_theme

The trend as incidence rate ratio (IRR) provides the average relative change in the number of deaths from one week to the next.

Show the code
# Trend in the original scale of the outcome cases (deaths) wit 95% confidence intervals. 
IRR_trend <- exp(cbind(Estimate=mort_poissontrend$coef, confint(mort_poissontrend)))

(IRR_trend['index', ] - 1) *100
  Estimate      2.5 %     97.5 % 
0.01894927 0.01830203 0.01959652 

The result above indicates that from one week to another, the number of deaths increases in average 0,019% per week, with a 95% confidence interval of [0,018%; 0,020%].

Another way in R to obtain the incidence rate ratio (IRR) is:

{r: task-4-3-IRR-gtsummary} mort_poissontrend %>% gtsummary::tbl_regression( exponentiate = T, estimate_fun = ~style_number(.x, digits = 4) )

This code creates a regression table using the tbl_regression() function from the gtsummary package. It has two arguments: - exponentiate = T (or TRUE) This tells R to exponentiate the coefficients. - estimate_fun = ~style_number(.x, digits = 4) This controls how the numbers in your table are formatted. .x represents each coefficient value, and style_number() formats it to 4 decimal places.

Task 4.4

How could you take the changing population size of Spain into account in your analyses?

Help for Task 4.4

To take the population into account, we need to include the population in the model. This is done by adding a term called offset(). The parameter beta for the offset, in this case for the population, will be forced to be 1; in other words, the offset will not affect the outcome. The advantage is that we can interpret the IRR of time in terms of mortality rate = cases/population.

Show the code
mort_poissontrend_pop <- glm(cases ~  index + offset(log(pop)),
                          data = mortz,
                          family = "poisson"
                        )

summary(mort_poissontrend_pop)

Call:
glm(formula = cases ~ index + offset(log(pop)), family = "poisson", 
    data = mortz)

Coefficients:
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -8.746e+00  1.006e-03 -8691.66   <2e-16 ***
index        1.863e-04  3.299e-06    56.47   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 62588  on 520  degrees of freedom
Residual deviance: 59399  on 519  degrees of freedom
AIC: 65025

Number of Fisher Scoring iterations: 4
Show the code
# Trend in the original scale of the outcome cases (deaths) with 95% confidence intervals. 
IRR_trend_pop <- exp(cbind(Estimate=mort_poissontrend_pop$coef, confint(mort_poissontrend_pop)))

(IRR_trend_pop['index', ] - 1) *100
  Estimate      2.5 %     97.5 % 
0.01863368 0.01798691 0.01928046 

The summary() function will only provide the coefficients in the original log scale of the poisson model. Another way of obtaining the IRR is by using tbl_regression with the argument exponentiate = TRUE:

mort_poissontrend_pop %>%         
  gtsummary::tbl_regression(            
    exponentiate = TRUE,
    estimate_fun = ~style_number(.x, digits = 4)
  )

plot the predicted and the original values:

Show the code
ggplot(data = mortz, mapping = aes(x = date_index)) +
  geom_point(mapping = aes(y = cases/pop), alpha = 0.5) +
  geom_line(mapping = aes(y = fitted(mort_poissontrend_pop)/pop), colour = "blue", lwd=1.1) +
  scale_y_continuous(limits = c(0, NA)) +
  scale_x_yearweek(date_labels = "%Y-%W", 
                   date_breaks = "1 year") +
  labs(
    x = "Week", y = "Mortality rate",
    title = "Spain: Mortality rate per week with fitted trend"
  ) +
  tsa_theme

Compare the trend and IRR in the models with and without population.

For communication, it might be better to present the results in terms of number of deaths per 100.000 inhabitants (population), which would then be called the mortality rate. We will now plot the same trend but as rate per 100.000 inhabitants. For this purpose we must adjust both the observed and the fitted values by multiplying with 100.000.

Show the code
mortz<- mortz %>%
  mutate(
    mortality_rate = cases/pop*100000
  )


ggplot(data = mortz, mapping = aes(x = date_index)) +
  geom_point(mapping = aes(y = mortality_rate), alpha = 0.5) +
  geom_line(mapping = aes(y = fitted(mort_poissontrend_pop)/pop*100000), colour = "blue", lwd=1.1) +
  scale_y_continuous(limits = c(0, NA)) +
  scale_x_yearweek(date_labels = "%Y-%W", 
                   date_breaks = "1 year") +
  labs(
    x = "Week", y = "Mortality rate (per 100.000 pop)",
    title = "Spain: Mortality rate per 100.000 population per week with fitted trend"
  ) +
  tsa_theme

How do you interpret the IRR in the model for mortality rate?

Task 4.5 (Optional)

Carry out the same analysis for men and women. Do you see differences between the two? where?

Save

Save mortagg to be used in the next practical.

Show the code
# Save the mortagg and mortz objects.
#save(mortagg, mortz, file = here("data", "mortagg_case_4.RData"))