Show the code
load(here("data","mortagg2_case_5.RData"))By the end of this session, participants should be able to:
decide on remaining actions when modelling surveillance data after you have accounted for trend and periodicity
detect autocorrelation in the residuals of models on surveillance data
Start by fitting the model with trend and two cosines and two sines.
Assuming you are happy with that Poisson regression model from the previous session, check whether this regression model was appropriate in this case.
load(here("data","mortagg2_case_5.RData"))Start by fitting the model with trend and two cosines and two sines (same as in practical 5).
Then, assuming you are happy with that model, check whether this regression model was appropriate in this case.
Plot the residuals from the model.
mortz <-
mortz %>%
mutate(res=residuals(mort_trendsin2cos2)
)ggplot(data = mortz) +
geom_point(
mapping = aes(x = date_index, y = res)
) +
scale_x_yearweek(date_labels = "%Y-%W",
date_breaks = "1 year") +
labs(x = "Year week",
y = "Residuals",
title = "Spain: number of deaths. Residuals model trend + 2 sine + 2 cosine."
) +
tsa_themeStart by examining the residuals against time. Do you see any additional patterns?
Is there any other periodicity that can be observed in the residuals?
To assess whether there is any more periodicity in the data, one can produce a periodogram of the residuals (use code from case study 5).
mort_residual_period <- TSA::periodogram(mortz$res)mort_residual_period_recip <-
tibble(
freq = mort_residual_period$freq,
spec = mort_residual_period$spec
) %>%
arrange(desc(spec)) %>%
mutate(reciprocal_freq = 1 / freq) # here in weeks
view(mort_residual_period_recip)
# breaks every 13, equivalent to every 3 months approximately.
ggplot(data = mort_residual_period_recip, aes(x = reciprocal_freq, y = 2*spec)) +
geom_line() +
scale_x_continuous(limits = c(0, 160), breaks=seq(0, 160, 13)) +
labs(x = "Period (weeks)",
y = "Spectral density",
title="Periodogram residuals of model with trend + 2 sine + 2 cosine.") +
tsa_themePlot the residuals again, but now using lines.
max.index <- max(mortz$index)
ggplot(data = mortz) +
geom_line(
mapping = aes(x = index, y = res)
) +
labs(
x = "Week",
y = "Residuals",
title = "Regression model with trend + 2 sine + 2 cosine."
) +
scale_x_continuous(limits = c(0, max.index), breaks=seq(0, max.index, 77)) +
tsa_themeBased on these latest plots, do you see a patterns that can explain the periodogram? Does the periodicity makes sense in this case?
Check if there is autocorrelation that needs to be included in the model.
Note that there are acf functions in more than one package, so to be sure you are using the using the base R function here use stats::acf. Graph the autocorrelation on 50 (week) lags.
mort_res_acf <- stats::acf(mortz$res, lag.max = 52, type="correlation", plot=TRUE)mort_res_acf
Autocorrelations of series 'mortz$res', by lag
0 1 2 3 4 5 6 7 8 9 10
1.000 0.787 0.584 0.420 0.304 0.212 0.152 0.087 0.044 0.037 0.061
11 12 13 14 15 16 17 18 19 20 21
0.079 0.114 0.116 0.108 0.094 0.064 0.046 0.053 0.052 0.040 0.031
22 23 24 25 26 27 28 29 30 31 32
0.022 0.031 0.042 0.059 0.036 -0.002 -0.030 -0.061 -0.080 -0.098 -0.114
33 34 35 36 37 38 39 40 41 42 43
-0.092 -0.057 -0.041 -0.025 -0.012 -0.004 -0.036 -0.066 -0.093 -0.128 -0.164
44 45 46 47 48 49 50 51 52
-0.193 -0.219 -0.245 -0.257 -0.254 -0.239 -0.198 -0.150 -0.097
The acf function estimates and plots autocorrelations. The lag zero vertical line is omitted in this plot because it is always 1, as it shows the correlation of each observation with itself (lag zero).
The first vertical line shows the correlation between each observation and the observation just before it (lag one). The second shows the correlation between each observation and the observation before last (lag two). And so on.
The blue dotted line is the 95% confidence interval. Vertical lines which reach above or below the confidence limits indicate significant autocorrelations, and is based on the number of observations.
If there was no significant autocorrelation, only the first vertical line would exceed the confidence interval.
Test for significant autocorrelations in the residuals.
Box.test(mortz$res, type = "Ljung-Box")
Box-Ljung test
data: mortz$res
X-squared = 324.38, df = 1, p-value < 2.2e-16
The Ljung-Box is a portmanteau-type test that examines the null hypothesis of independence of the residuals in a time series. A significant \(p\)-value suggests that there is autocorrelation in the data, that is, the observations are not independent. A portmanteau-type test are those where the alternative hypothesis is not well defined, but the null hypothesis (here independence) is well defined.
In the acf plot, the first 7 lags where statistically significant. We will add them one by one, checking everytime if the added lag(s) improve the fit of the model.
mortz <-
mortz %>%
mutate(
lag1 = dplyr::lag(cases, n=1),
lag2 = dplyr::lag(cases, n=2),
lag3 = dplyr::lag(cases, n=3),
lag4 = dplyr::lag(cases, n=4),
lag5 = dplyr::lag(cases, n=5),
lag6 = dplyr::lag(cases, n=6),
lag7 = dplyr::lag(cases, n=7),
lag8 = dplyr::lag(cases, n=8)
)
head(as.data.frame(mortz), 10) cases_m cases_f cases year week pop pop_f pop_m date_index index
1 4207 4019 8226 2010 1 46486621 23504349 22982272 2010 W01 1
2 4409 4264 8673 2010 2 46486621 23504349 22982272 2010 W02 2
3 4229 4122 8351 2010 3 46486621 23504349 22982272 2010 W03 3
4 4252 3837 8089 2010 4 46486621 23504349 22982272 2010 W04 4
5 4196 3976 8172 2010 5 46486621 23504349 22982272 2010 W05 5
6 4127 3854 7981 2010 6 46486621 23504349 22982272 2010 W06 6
7 4294 4028 8322 2010 7 46486621 23504349 22982272 2010 W07 7
8 4121 3968 8089 2010 8 46486621 23504349 22982272 2010 W08 8
9 3824 3687 7511 2010 9 46486621 23504349 22982272 2010 W09 9
10 3921 3612 7533 2010 10 46486621 23504349 22982272 2010 W10 10
mortality_rate cos52 sin52 sin26 cos26 res lag1
1 0.1769541 0.9927089 0.1205367 0.2393157 0.9709418 -3.31833758 NA
2 0.1865698 0.9709418 0.2393157 0.4647232 0.8854560 -0.07499005 8226
3 0.1796431 0.9350162 0.3546049 0.6631227 0.7485107 -4.71589070 8673
4 0.1740071 0.8854560 0.4647232 0.8229839 0.5680647 -8.24732497 8351
5 0.1757925 0.8229839 0.5680647 0.9350162 0.3546049 -7.51002029 8089
6 0.1716838 0.7485107 0.6631227 0.9927089 0.1205367 -9.27104986 8172
7 0.1790192 0.6631227 0.7485107 0.9927089 -0.1205367 -4.73392309 7981
8 0.1740071 0.5680647 0.8229839 0.9350162 -0.3546049 -6.04334841 8322
9 0.1615734 0.4647232 0.8854560 0.8229839 -0.5680647 -10.89953779 8089
10 0.1620466 0.3546049 0.9350162 0.6631227 -0.7485107 -8.79571866 7511
lag2 lag3 lag4 lag5 lag6 lag7 lag8
1 NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA
3 8226 NA NA NA NA NA NA
4 8673 8226 NA NA NA NA NA
5 8351 8673 8226 NA NA NA NA
6 8089 8351 8673 8226 NA NA NA
7 8172 8089 8351 8673 8226 NA NA
8 7981 8172 8089 8351 8673 8226 NA
9 8322 7981 8172 8089 8351 8673 8226
10 8089 8322 7981 8172 8089 8351 8673
tail(as.data.frame(mortz), 10) cases_m cases_f cases year week pop pop_f pop_m date_index index
512 3866 3719 7585 2019 43 46918951 23902168 23016783 2019 W43 512
513 3913 3839 7752 2019 44 46918951 23902168 23016783 2019 W44 513
514 4019 3714 7733 2019 45 46918951 23902168 23016783 2019 W45 514
515 4143 3934 8077 2019 46 46918951 23902168 23016783 2019 W46 515
516 4213 4045 8258 2019 47 46918951 23902168 23016783 2019 W47 516
517 4287 4080 8367 2019 48 46918951 23902168 23016783 2019 W48 517
518 4099 3963 8062 2019 49 46918951 23902168 23016783 2019 W49 518
519 4325 4050 8375 2019 50 46918951 23902168 23016783 2019 W50 519
520 4198 4252 8450 2019 51 46918951 23902168 23016783 2019 W51 520
521 4248 4133 8381 2019 52 46918951 23902168 23016783 2019 W52 521
mortality_rate cos52 sin52 sin26 cos26
512 0.1616618 0.5680647 -8.229839e-01 -9.350162e-01 -0.3546049
513 0.1652211 0.6631227 -7.485107e-01 -9.927089e-01 -0.1205367
514 0.1648161 0.7485107 -6.631227e-01 -9.927089e-01 0.1205367
515 0.1721479 0.8229839 -5.680647e-01 -9.350162e-01 0.3546049
516 0.1760056 0.8854560 -4.647232e-01 -8.229839e-01 0.5680647
517 0.1783288 0.9350162 -3.546049e-01 -6.631227e-01 0.7485107
518 0.1718282 0.9709418 -2.393157e-01 -4.647232e-01 0.8854560
519 0.1784993 0.9927089 -1.205367e-01 -2.393157e-01 0.9709418
520 0.1800978 1.0000000 -2.449213e-15 -4.898425e-15 1.0000000
521 0.1786272 0.9927089 1.205367e-01 2.393157e-01 0.9709418
res lag1 lag2 lag3 lag4 lag5 lag6 lag7 lag8
512 -0.11491902 7373 7376 7033 7089 7058 6986 6948 6939
513 -0.08713756 7585 7373 7376 7033 7089 7058 6986 6948
514 -2.47822215 7752 7585 7373 7376 7033 7089 7058 6986
515 -1.04352843 7733 7752 7585 7373 7376 7033 7089 7058
516 -1.64895152 8077 7733 7752 7585 7373 7376 7033 7089
517 -3.15314754 8258 8077 7733 7752 7585 7373 7376 7033
518 -9.15903419 8367 8258 8077 7733 7752 7585 7373 7376
519 -8.32865535 8062 8367 8258 8077 7733 7752 7585 7373
520 -9.89024246 8375 8062 8367 8258 8077 7733 7752 7585
521 -12.68685589 8450 8375 8062 8367 8258 8077 7733 7752
As the data is lagged, the missing value (NA) must be added at the beginning. For lag 1, the value for 2010 week 1 is shifted to 2010 week 2, and therefore week 1 is now NA. That also means that the last value in 2019 week 52 is lost and has been replaced by what it wasw before 2019 week 51. If you are not familiar with lags, take some time to study how the data has been shifted.
Another side effect of lagging the data, is that the number of observations decreases. At the start, we had 521 observations (years and weeks); after lag 1, we have 520; after lag 2, we have 519; etc. This affects when we wish to compare the models using the likelihood ratio test (LRT), since the test requires that the same dataset be used in the two models being compared. For that reason, the original model will be refitted, but each time using the same data as in the model with lags.
No we will add one lag at a time to our model with trend, cos52, sin52, cos26 and sin26, and test if adding each lag improves the fit. This could be automatized, but we will do it “by hand” this time, in order to secure that we understand what is going on.
Normally, all lags are added. Even if only lag 2 is significant, lag 1 will also be included in the model. This is because it is difficult to imagine that only the observation lagged 2 is important, but not the one lagged 1.
# Model with trend+2sin+2cos+lag1
mort_trendsin2cos2_ac <- update(mort_trendsin2cos2, .~. + lag1,
data=mortz)
summary(mort_trendsin2cos2_ac)
Call:
glm(formula = cases ~ index + sin52 + cos52 + sin26 + cos26 +
lag1, family = "poisson", data = mortz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.242e+00 7.284e-03 1131.60 < 2e-16 ***
index 5.969e-05 3.802e-06 15.70 < 2e-16 ***
sin52 1.693e-02 1.086e-03 15.58 < 2e-16 ***
cos52 3.567e-02 1.021e-03 34.93 < 2e-16 ***
sin26 5.267e-03 8.706e-04 6.05 1.45e-09 ***
cos26 1.926e-02 7.232e-04 26.63 < 2e-16 ***
lag1 8.928e-05 9.854e-07 90.60 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 62485 on 519 degrees of freedom
Residual deviance: 4217 on 513 degrees of freedom
(1 observation deleted due to missingness)
AIC: 9842.1
Number of Fisher Scoring iterations: 3
# Model with trend+2sin+2cos and first row removed.
mort_trendsin2cos2_removeNA <- update(mort_trendsin2cos2, .~.,
data=mortz[-1,])
summary(mort_trendsin2cos2_removeNA)
Call:
glm(formula = cases ~ index + sin52 + cos52 + sin26 + cos26,
family = "poisson", data = mortz[-1, ])
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.895e+00 1.017e-03 8744.99 <2e-16 ***
index 2.256e-04 3.324e-06 67.86 <2e-16 ***
sin52 9.034e-02 7.142e-04 126.48 <2e-16 ***
cos52 1.022e-01 7.006e-04 145.90 <2e-16 ***
sin26 5.077e-02 7.054e-04 71.97 <2e-16 ***
cos26 3.312e-02 7.047e-04 47.00 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 62485 on 519 degrees of freedom
Residual deviance: 12277 on 514 degrees of freedom
AIC: 17900
Number of Fisher Scoring iterations: 3
lrtest(mort_trendsin2cos2_removeNA, mort_trendsin2cos2_ac)Likelihood ratio test
Model 1: cases ~ index + sin52 + cos52 + sin26 + cos26
Model 2: cases ~ index + sin52 + cos52 + sin26 + cos26 + lag1
#Df LogLik Df Chisq Pr(>Chisq)
1 6 -8943.9
2 7 -4914.1 1 8059.6 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Lag 2: update mdoel with lag 1
mort_trendsin2cos2_ac <- update(mort_trendsin2cos2_ac, .~. + lag2,
data=mortz)
summary(mort_trendsin2cos2_ac)
Call:
glm(formula = cases ~ index + sin52 + cos52 + sin26 + cos26 +
lag1 + lag2, family = "poisson", data = mortz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.266e+00 7.645e-03 1081.267 <2e-16 ***
index 6.726e-05 3.860e-06 17.426 <2e-16 ***
sin52 2.134e-02 1.162e-03 18.365 <2e-16 ***
cos52 3.601e-02 1.025e-03 35.144 <2e-16 ***
sin26 7.362e-03 8.941e-04 8.234 <2e-16 ***
cos26 1.713e-02 7.469e-04 22.938 <2e-16 ***
lag1 1.058e-04 1.790e-06 59.101 <2e-16 ***
lag2 -1.987e-05 1.810e-06 -10.978 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 62387 on 518 degrees of freedom
Residual deviance: 4087 on 511 degrees of freedom
(2 observations deleted due to missingness)
AIC: 9703.3
Number of Fisher Scoring iterations: 3
# Model with trend+2sin+2cos and rows removed.
mort_trendsin2cos2_removeNA <- update(mort_trendsin2cos2, .~.,
data=mortz[-(1:2),])
summary(mort_trendsin2cos2_removeNA)
Call:
glm(formula = cases ~ index + sin52 + cos52 + sin26 + cos26,
family = "poisson", data = mortz[-(1:2), ])
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.895e+00 1.021e-03 8709.41 <2e-16 ***
index 2.255e-04 3.335e-06 67.62 <2e-16 ***
sin52 9.034e-02 7.143e-04 126.48 <2e-16 ***
cos52 1.022e-01 7.017e-04 145.67 <2e-16 ***
sin26 5.077e-02 7.056e-04 71.95 <2e-16 ***
cos26 3.312e-02 7.058e-04 46.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 62387 on 518 degrees of freedom
Residual deviance: 12277 on 513 degrees of freedom
AIC: 17889
Number of Fisher Scoring iterations: 3
lrtest(mort_trendsin2cos2_removeNA, mort_trendsin2cos2_ac)Likelihood ratio test
Model 1: cases ~ index + sin52 + cos52 + sin26 + cos26
Model 2: cases ~ index + sin52 + cos52 + sin26 + cos26 + lag1 + lag2
#Df LogLik Df Chisq Pr(>Chisq)
1 6 -8938.4
2 8 -4843.6 2 8189.5 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##### Continue to test the remaining significant lags. #############################How many lags were included in the best model? Does the model make sense?
Which model would you choose to present to your boss?
As a sensitivity analysis, add yet one more lag (not significant in the acf) to the last model. Does that lag improve the model?
Carry out similar analyzes for men and for women.
#save(mortagg, mortz, file = here("data", "mortagg2_case_6.RData"))
#load(here("data","mortagg2_case_6.RData"))