Practical Session 6: Residuals and autocorrelation

Residuals patterns in Surveillance data

Expected learning outcomes

By the end of this session, participants should be able to:

  • decide on remaining actions when modelling surveillance data after you have accounted for trend and periodicity

  • detect autocorrelation in the residuals of models on surveillance data

Task 6.1

Start by fitting the model with trend and two cosines and two sines.

Assuming you are happy with that Poisson regression model from the previous session, check whether this regression model was appropriate in this case.

Show the code
load(here("data","mortagg2_case_5.RData"))

Start by fitting the model with trend and two cosines and two sines (same as in practical 5).

Then, assuming you are happy with that model, check whether this regression model was appropriate in this case.

Plot the residuals from the model.

Show the code
mortz <-
    mortz %>%
    mutate(res=residuals(mort_trendsin2cos2)
    )
Show the code
ggplot(data = mortz) +
    geom_point(
        mapping = aes(x = date_index, y = res)
    ) +
    scale_x_yearweek(date_labels = "%Y-%W", 
                   date_breaks = "1 year") +
    labs(x = "Year week", 
         y = "Residuals", 
         title = "Spain: number of deaths. Residuals model trend + 2 sine + 2 cosine."
    ) +
    tsa_theme

Start by examining the residuals against time. Do you see any additional patterns?

Task 6.2

Is there any other periodicity that can be observed in the residuals?

To assess whether there is any more periodicity in the data, one can produce a periodogram of the residuals (use code from case study 5).

Show the code
mort_residual_period <- TSA::periodogram(mortz$res)

Show the code
mort_residual_period_recip <-
    tibble(
        freq = mort_residual_period$freq,
        spec = mort_residual_period$spec
    ) %>%
    arrange(desc(spec)) %>% 
    mutate(reciprocal_freq = 1 / freq)  # here in weeks

view(mort_residual_period_recip)

# breaks every 13, equivalent to every 3 months approximately.
ggplot(data = mort_residual_period_recip, aes(x = reciprocal_freq, y = 2*spec)) +
    geom_line() +
    scale_x_continuous(limits = c(0, 160), breaks=seq(0, 160, 13)) +
    labs(x = "Period (weeks)", 
         y = "Spectral density",
         title="Periodogram residuals of model with trend + 2 sine + 2 cosine.") +
    tsa_theme

Plot the residuals again, but now using lines.

Show the code
max.index <- max(mortz$index)

ggplot(data = mortz) +
    geom_line(
        mapping = aes(x = index, y = res)
    ) +
    labs(
        x = "Week",
        y = "Residuals",
        title = "Regression model with trend + 2 sine + 2 cosine."
    ) +
    scale_x_continuous(limits = c(0, max.index), breaks=seq(0, max.index, 77)) +
    tsa_theme

Based on these latest plots, do you see a patterns that can explain the periodogram? Does the periodicity makes sense in this case?

Task 6.3

Check if there is autocorrelation that needs to be included in the model.

Note that there are acf functions in more than one package, so to be sure you are using the using the base R function here use stats::acf. Graph the autocorrelation on 50 (week) lags.

Show the code
mort_res_acf <- stats::acf(mortz$res, lag.max = 52, type="correlation", plot=TRUE)

Show the code
mort_res_acf

Autocorrelations of series 'mortz$res', by lag

     0      1      2      3      4      5      6      7      8      9     10 
 1.000  0.787  0.584  0.420  0.304  0.212  0.152  0.087  0.044  0.037  0.061 
    11     12     13     14     15     16     17     18     19     20     21 
 0.079  0.114  0.116  0.108  0.094  0.064  0.046  0.053  0.052  0.040  0.031 
    22     23     24     25     26     27     28     29     30     31     32 
 0.022  0.031  0.042  0.059  0.036 -0.002 -0.030 -0.061 -0.080 -0.098 -0.114 
    33     34     35     36     37     38     39     40     41     42     43 
-0.092 -0.057 -0.041 -0.025 -0.012 -0.004 -0.036 -0.066 -0.093 -0.128 -0.164 
    44     45     46     47     48     49     50     51     52 
-0.193 -0.219 -0.245 -0.257 -0.254 -0.239 -0.198 -0.150 -0.097 

The acf function estimates and plots autocorrelations. The lag zero vertical line is omitted in this plot because it is always 1, as it shows the correlation of each observation with itself (lag zero).

The first vertical line shows the correlation between each observation and the observation just before it (lag one). The second shows the correlation between each observation and the observation before last (lag two). And so on.

The blue dotted line is the 95% confidence interval. Vertical lines which reach above or below the confidence limits indicate significant autocorrelations, and is based on the number of observations.

If there was no significant autocorrelation, only the first vertical line would exceed the confidence interval.

Test for significant autocorrelations in the residuals.

Show the code
Box.test(mortz$res, type = "Ljung-Box")

    Box-Ljung test

data:  mortz$res
X-squared = 324.38, df = 1, p-value < 2.2e-16

The Ljung-Box is a portmanteau-type test that examines the null hypothesis of independence of the residuals in a time series. A significant \(p\)-value suggests that there is autocorrelation in the data, that is, the observations are not independent. A portmanteau-type test are those where the alternative hypothesis is not well defined, but the null hypothesis (here independence) is well defined.

In the acf plot, the first 7 lags where statistically significant. We will add them one by one, checking everytime if the added lag(s) improve the fit of the model.

Show the code
mortz <-
    mortz %>%
    mutate(
        lag1 =  dplyr::lag(cases, n=1),   
        lag2 =  dplyr::lag(cases, n=2),   
        lag3 =  dplyr::lag(cases, n=3),   
        lag4 =  dplyr::lag(cases, n=4),   
        lag5 =  dplyr::lag(cases, n=5),
        lag6 =  dplyr::lag(cases, n=6),    
        lag7 =  dplyr::lag(cases, n=7), 
        lag8 =  dplyr::lag(cases, n=8)  

   )
   
   
head(as.data.frame(mortz), 10)
   cases_m cases_f cases year week      pop    pop_f    pop_m date_index index
1     4207    4019  8226 2010    1 46486621 23504349 22982272   2010 W01     1
2     4409    4264  8673 2010    2 46486621 23504349 22982272   2010 W02     2
3     4229    4122  8351 2010    3 46486621 23504349 22982272   2010 W03     3
4     4252    3837  8089 2010    4 46486621 23504349 22982272   2010 W04     4
5     4196    3976  8172 2010    5 46486621 23504349 22982272   2010 W05     5
6     4127    3854  7981 2010    6 46486621 23504349 22982272   2010 W06     6
7     4294    4028  8322 2010    7 46486621 23504349 22982272   2010 W07     7
8     4121    3968  8089 2010    8 46486621 23504349 22982272   2010 W08     8
9     3824    3687  7511 2010    9 46486621 23504349 22982272   2010 W09     9
10    3921    3612  7533 2010   10 46486621 23504349 22982272   2010 W10    10
   mortality_rate     cos52     sin52     sin26      cos26          res lag1
1       0.1769541 0.9927089 0.1205367 0.2393157  0.9709418  -3.31833758   NA
2       0.1865698 0.9709418 0.2393157 0.4647232  0.8854560  -0.07499005 8226
3       0.1796431 0.9350162 0.3546049 0.6631227  0.7485107  -4.71589070 8673
4       0.1740071 0.8854560 0.4647232 0.8229839  0.5680647  -8.24732497 8351
5       0.1757925 0.8229839 0.5680647 0.9350162  0.3546049  -7.51002029 8089
6       0.1716838 0.7485107 0.6631227 0.9927089  0.1205367  -9.27104986 8172
7       0.1790192 0.6631227 0.7485107 0.9927089 -0.1205367  -4.73392309 7981
8       0.1740071 0.5680647 0.8229839 0.9350162 -0.3546049  -6.04334841 8322
9       0.1615734 0.4647232 0.8854560 0.8229839 -0.5680647 -10.89953779 8089
10      0.1620466 0.3546049 0.9350162 0.6631227 -0.7485107  -8.79571866 7511
   lag2 lag3 lag4 lag5 lag6 lag7 lag8
1    NA   NA   NA   NA   NA   NA   NA
2    NA   NA   NA   NA   NA   NA   NA
3  8226   NA   NA   NA   NA   NA   NA
4  8673 8226   NA   NA   NA   NA   NA
5  8351 8673 8226   NA   NA   NA   NA
6  8089 8351 8673 8226   NA   NA   NA
7  8172 8089 8351 8673 8226   NA   NA
8  7981 8172 8089 8351 8673 8226   NA
9  8322 7981 8172 8089 8351 8673 8226
10 8089 8322 7981 8172 8089 8351 8673
Show the code
tail(as.data.frame(mortz), 10)    
    cases_m cases_f cases year week      pop    pop_f    pop_m date_index index
512    3866    3719  7585 2019   43 46918951 23902168 23016783   2019 W43   512
513    3913    3839  7752 2019   44 46918951 23902168 23016783   2019 W44   513
514    4019    3714  7733 2019   45 46918951 23902168 23016783   2019 W45   514
515    4143    3934  8077 2019   46 46918951 23902168 23016783   2019 W46   515
516    4213    4045  8258 2019   47 46918951 23902168 23016783   2019 W47   516
517    4287    4080  8367 2019   48 46918951 23902168 23016783   2019 W48   517
518    4099    3963  8062 2019   49 46918951 23902168 23016783   2019 W49   518
519    4325    4050  8375 2019   50 46918951 23902168 23016783   2019 W50   519
520    4198    4252  8450 2019   51 46918951 23902168 23016783   2019 W51   520
521    4248    4133  8381 2019   52 46918951 23902168 23016783   2019 W52   521
    mortality_rate     cos52         sin52         sin26      cos26
512      0.1616618 0.5680647 -8.229839e-01 -9.350162e-01 -0.3546049
513      0.1652211 0.6631227 -7.485107e-01 -9.927089e-01 -0.1205367
514      0.1648161 0.7485107 -6.631227e-01 -9.927089e-01  0.1205367
515      0.1721479 0.8229839 -5.680647e-01 -9.350162e-01  0.3546049
516      0.1760056 0.8854560 -4.647232e-01 -8.229839e-01  0.5680647
517      0.1783288 0.9350162 -3.546049e-01 -6.631227e-01  0.7485107
518      0.1718282 0.9709418 -2.393157e-01 -4.647232e-01  0.8854560
519      0.1784993 0.9927089 -1.205367e-01 -2.393157e-01  0.9709418
520      0.1800978 1.0000000 -2.449213e-15 -4.898425e-15  1.0000000
521      0.1786272 0.9927089  1.205367e-01  2.393157e-01  0.9709418
             res lag1 lag2 lag3 lag4 lag5 lag6 lag7 lag8
512  -0.11491902 7373 7376 7033 7089 7058 6986 6948 6939
513  -0.08713756 7585 7373 7376 7033 7089 7058 6986 6948
514  -2.47822215 7752 7585 7373 7376 7033 7089 7058 6986
515  -1.04352843 7733 7752 7585 7373 7376 7033 7089 7058
516  -1.64895152 8077 7733 7752 7585 7373 7376 7033 7089
517  -3.15314754 8258 8077 7733 7752 7585 7373 7376 7033
518  -9.15903419 8367 8258 8077 7733 7752 7585 7373 7376
519  -8.32865535 8062 8367 8258 8077 7733 7752 7585 7373
520  -9.89024246 8375 8062 8367 8258 8077 7733 7752 7585
521 -12.68685589 8450 8375 8062 8367 8258 8077 7733 7752

As the data is lagged, the missing value (NA) must be added at the beginning. For lag 1, the value for 2010 week 1 is shifted to 2010 week 2, and therefore week 1 is now NA. That also means that the last value in 2019 week 52 is lost and has been replaced by what it wasw before 2019 week 51. If you are not familiar with lags, take some time to study how the data has been shifted.

Another side effect of lagging the data, is that the number of observations decreases. At the start, we had 521 observations (years and weeks); after lag 1, we have 520; after lag 2, we have 519; etc. This affects when we wish to compare the models using the likelihood ratio test (LRT), since the test requires that the same dataset be used in the two models being compared. For that reason, the original model will be refitted, but each time using the same data as in the model with lags.

No we will add one lag at a time to our model with trend, cos52, sin52, cos26 and sin26, and test if adding each lag improves the fit. This could be automatized, but we will do it “by hand” this time, in order to secure that we understand what is going on.

Normally, all lags are added. Even if only lag 2 is significant, lag 1 will also be included in the model. This is because it is difficult to imagine that only the observation lagged 2 is important, but not the one lagged 1.

Show the code
# Model with trend+2sin+2cos+lag1
mort_trendsin2cos2_ac <- update(mort_trendsin2cos2, .~. + lag1,
                                data=mortz)

  
summary(mort_trendsin2cos2_ac)

Call:
glm(formula = cases ~ index + sin52 + cos52 + sin26 + cos26 + 
    lag1, family = "poisson", data = mortz)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 8.242e+00  7.284e-03 1131.60  < 2e-16 ***
index       5.969e-05  3.802e-06   15.70  < 2e-16 ***
sin52       1.693e-02  1.086e-03   15.58  < 2e-16 ***
cos52       3.567e-02  1.021e-03   34.93  < 2e-16 ***
sin26       5.267e-03  8.706e-04    6.05 1.45e-09 ***
cos26       1.926e-02  7.232e-04   26.63  < 2e-16 ***
lag1        8.928e-05  9.854e-07   90.60  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 62485  on 519  degrees of freedom
Residual deviance:  4217  on 513  degrees of freedom
  (1 observation deleted due to missingness)
AIC: 9842.1

Number of Fisher Scoring iterations: 3
Show the code
# Model with trend+2sin+2cos and first row removed.
mort_trendsin2cos2_removeNA <- update(mort_trendsin2cos2, .~.,
                                data=mortz[-1,])
                          
summary(mort_trendsin2cos2_removeNA)

Call:
glm(formula = cases ~ index + sin52 + cos52 + sin26 + cos26, 
    family = "poisson", data = mortz[-1, ])

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 8.895e+00  1.017e-03 8744.99   <2e-16 ***
index       2.256e-04  3.324e-06   67.86   <2e-16 ***
sin52       9.034e-02  7.142e-04  126.48   <2e-16 ***
cos52       1.022e-01  7.006e-04  145.90   <2e-16 ***
sin26       5.077e-02  7.054e-04   71.97   <2e-16 ***
cos26       3.312e-02  7.047e-04   47.00   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 62485  on 519  degrees of freedom
Residual deviance: 12277  on 514  degrees of freedom
AIC: 17900

Number of Fisher Scoring iterations: 3
Show the code
lrtest(mort_trendsin2cos2_removeNA, mort_trendsin2cos2_ac)
Likelihood ratio test

Model 1: cases ~ index + sin52 + cos52 + sin26 + cos26
Model 2: cases ~ index + sin52 + cos52 + sin26 + cos26 + lag1
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   6 -8943.9                         
2   7 -4914.1  1 8059.6  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
## Lag 2: update mdoel with lag 1
mort_trendsin2cos2_ac <- update(mort_trendsin2cos2_ac, .~. + lag2,
                                data=mortz)

  
summary(mort_trendsin2cos2_ac)

Call:
glm(formula = cases ~ index + sin52 + cos52 + sin26 + cos26 + 
    lag1 + lag2, family = "poisson", data = mortz)

Coefficients:
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept)  8.266e+00  7.645e-03 1081.267   <2e-16 ***
index        6.726e-05  3.860e-06   17.426   <2e-16 ***
sin52        2.134e-02  1.162e-03   18.365   <2e-16 ***
cos52        3.601e-02  1.025e-03   35.144   <2e-16 ***
sin26        7.362e-03  8.941e-04    8.234   <2e-16 ***
cos26        1.713e-02  7.469e-04   22.938   <2e-16 ***
lag1         1.058e-04  1.790e-06   59.101   <2e-16 ***
lag2        -1.987e-05  1.810e-06  -10.978   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 62387  on 518  degrees of freedom
Residual deviance:  4087  on 511  degrees of freedom
  (2 observations deleted due to missingness)
AIC: 9703.3

Number of Fisher Scoring iterations: 3
Show the code
# Model with trend+2sin+2cos and rows removed.
mort_trendsin2cos2_removeNA <- update(mort_trendsin2cos2, .~.,
                                data=mortz[-(1:2),])
                          
summary(mort_trendsin2cos2_removeNA)

Call:
glm(formula = cases ~ index + sin52 + cos52 + sin26 + cos26, 
    family = "poisson", data = mortz[-(1:2), ])

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 8.895e+00  1.021e-03 8709.41   <2e-16 ***
index       2.255e-04  3.335e-06   67.62   <2e-16 ***
sin52       9.034e-02  7.143e-04  126.48   <2e-16 ***
cos52       1.022e-01  7.017e-04  145.67   <2e-16 ***
sin26       5.077e-02  7.056e-04   71.95   <2e-16 ***
cos26       3.312e-02  7.058e-04   46.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 62387  on 518  degrees of freedom
Residual deviance: 12277  on 513  degrees of freedom
AIC: 17889

Number of Fisher Scoring iterations: 3
Show the code
lrtest(mort_trendsin2cos2_removeNA, mort_trendsin2cos2_ac)
Likelihood ratio test

Model 1: cases ~ index + sin52 + cos52 + sin26 + cos26
Model 2: cases ~ index + sin52 + cos52 + sin26 + cos26 + lag1 + lag2
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   6 -8938.4                         
2   8 -4843.6  2 8189.5  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
##### Continue to test the remaining significant lags. #############################

How many lags were included in the best model? Does the model make sense?

Which model would you choose to present to your boss?

Task 6.3 (Optional)

As a sensitivity analysis, add yet one more lag (not significant in the acf) to the last model. Does that lag improve the model?

Task 6.4 (Optional)

Carry out similar analyzes for men and for women.

Show the code
#save(mortagg, mortz, file = here("data", "mortagg2_case_6.RData"))

#load(here("data","mortagg2_case_6.RData"))