rm(list=ls())# Install pacman if not installed already, and activate itif (!require("pacman")) install.packages("pacman")# Install, update and activate librariespacman::p_load( here, rio, skimr, tsibble, ISOweek, slider, # for rolling means pander, tidyverse, gtsummary)# Create tsa_themetsa_theme <-theme_bw() +theme(plot.title =element_text(face ="bold", size =12),legend.background =element_rect(fill ="white", size =4, colour ="white"),legend.justification =c(0, 1),legend.position ="bottom",panel.grid.minor =element_blank() )
Mortality Surveillance data in Spain
The number of deaths in a country can be used as a surveillance indicator. An example of such an application is EuroMOMO, which monitors number of fatalities in European countries on a weekly basis (https://www.euromomo.eu/). The weekly time series on the number of deathls are routinely used to follow seasonal infections, for example influenza and covid 19 during the winter, but also heatwaves in the summer.
In case studies 4 to 7, we will analyse the number of deaths retrieved from the Spanish national statistical institute (INE) (https://ine.es/). Each case study will build up from the analysis conducted in the previous case study. The data contains the number of fatalities per week for the entire country, from 2010 to 2019, and includes variables on year, week, total number of deaths (cases), and total population; the same data is also included for men and women.
Expected learning outcomes
By the end of the session, participants should be able to:
Describe, test and fit a trend in surveillance data (simple smoothing and regression);
Assess and interpret the significance of trend in surveillance data.
Save any changes to a new dataset named `mortagg.
Task 4.1
Assess visually how the total registered number of deaths has been behaving in the years with available data.
Discuss your results with your peers.
Moving average and direct statistical modelling of trends
Moving averages are simple methods to visualise the general trend of a series after removing some of the random day-to-day variation by smoothing the data. This allows you to browse your data for periodicity and observe the general trend. In other words, smoothing the data may remove “noise” from your time series and can facilitate visual interpretation.
Moving averages model a time series by calculating the numerical mean of the values adjacent to it. It is calculated for each observation, moving along the time axis: e.g. at each time \(t\) and for a window of 5 time units, one way of calculating the moving average is by using the observations at \(t-2\), \(t-1\), \(t\), \(t+1\) and \(t+2\).
Help for Task 4.1
Load the mortagg.Rdata dataset.
Show the code
load(here("data", "mortagg2.Rdata"))
Inspect the data.
Show the code
str(mortagg)
'data.frame': 521 obs. of 8 variables:
$ cases_m: num 4207 4409 4229 4252 4196 ...
$ cases_f: num 4019 4264 4122 3837 3976 ...
$ cases : num 8226 8673 8351 8089 8172 ...
$ year : num 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
$ week : num 1 2 3 4 5 6 7 8 9 10 ...
$ pop : num 46486621 46486621 46486621 46486621 46486621 ...
$ pop_f : num 23504349 23504349 23504349 23504349 23504349 ...
$ pop_m : num 2.3e+07 2.3e+07 2.3e+07 2.3e+07 2.3e+07 ...
Show the code
summary(mortagg)
cases_m cases_f cases year week
Min. :3276 Min. :2941 Min. : 6316 Min. :2010 Min. : 1.00
1st Qu.:3648 1st Qu.:3464 1st Qu.: 7106 1st Qu.:2012 1st Qu.:14.00
Median :3839 Median :3684 Median : 7510 Median :2015 Median :27.00
Mean :3969 Mean :3819 Mean : 7788 Mean :2015 Mean :26.55
3rd Qu.:4194 3rd Qu.:4047 3rd Qu.: 8217 3rd Qu.:2017 3rd Qu.:40.00
Max. :5597 Max. :5877 Max. :11471 Max. :2019 Max. :53.00
pop pop_f pop_m
Min. :46418884 Min. :23504349 Min. :22806445
1st Qu.:46486621 1st Qu.:23612439 1st Qu.:22831107
Median :46497393 Median :23621034 Median :22889600
Mean :46608292 Mean :23669979 Mean :22938313
3rd Qu.:46712650 3rd Qu.:23719207 3rd Qu.:23016783
Max. :46918951 Max. :23902168 Max. :23099009
Show the code
view(mortagg)
Show the code
skimr::skim(mortagg)
Data summary
Name
mortagg
Number of rows
521
Number of columns
8
_______________________
Column type frequency:
numeric
8
________________________
Group variables
None
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
cases_m
0
1
3969.15
454.44
3276
3648
3839
4194
5597
▇▇▃▁▁
cases_f
0
1
3818.53
544.44
2941
3464
3684
4047
5877
▆▇▃▁▁
cases
0
1
7787.68
992.14
6316
7106
7510
8217
11471
▇▇▃▁▁
year
0
1
2014.50
2.87
2010
2012
2015
2017
2019
▇▇▇▇▇
week
0
1
26.55
15.05
1
14
27
40
53
▇▇▇▇▇
pop
0
1
46608291.50
163065.99
46418884
46486621
46497393
46712650
46918951
▇▁▅▂▂
pop_f
0
1
23669978.80
102454.81
23504349
23612439
23621034
23719207
23902168
▂▇▆▂▂
pop_m
0
1
22938312.61
100399.88
22806445
22831107
22889600
23016783
23099009
▇▅▁▇▅
Create a time series object with tsibble, using the total case counts and plot the time series.
Show the code
mortz <- mortagg %>%mutate(date_index =make_yearweek(year = year, week = week)) %>%as_tsibble(index = date_index)ggplot(data = mortz) +geom_line(mapping =aes(x = date_index, y = cases)) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +labs(x ="Year Week", y ="Number of Deaths", title ="Spain: number of deaths per week") + tsa_theme
Show the code
## Adjusting y-axis scaleggplot(data = mortz) +geom_line(mapping =aes(x = date_index, y = cases)) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Date", y ="Number of Deaths", title ="Spain: number of deaths per week") + tsa_theme
Create a similar plot for men and women.
Show the code
## Men and Women in wide formatcolors <-c("cases_m"="green", "cases_f"="black")ggplot(data = mortz, aes(x = date_index)) +geom_line(mapping =aes(y = cases_m, colour ="cases_m"), lwd=1) +geom_line(mapping =aes(y = cases_f, colour ="cases_f"), lwd=1) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Week", y ="Number of Deaths", title ="Spain: number of deaths per week and by sex", colour ="Legend") +scale_color_manual(values = colors, name="", labels =c("Men", "Women")) + tsa_theme
Task 4.2
Assess visually the existence of a long-term trend and periodicity in the mortaagg dataset by using smoothing techniques.
Consider: where would you centre the smoothing window and why? What is the effect of using different smoothing window centring options? Test!
Of all the methods applied here (moving average with different windows, loess with different spans), which one do you think is the best for eliminating seasonality? Where would you centre the smoothing window and why? What would happen if you used an even greater window? Test!
Help for Task 4.2
According to the slider package description, slider is a package for rolling windows analysis. It means that a given function is repeatedly applied to different “windows” of your data as you step through it. Typical examples of applications of rolling window functions include moving averages or cumulative sums. An introduction vignette for slider can be found by running the following command: vignette("slider")
For each record, create the following various types of moving average.
MA5a: the 5-week moving average, centered on cases.
MA5b: the 5-week moving average of cases and the 4 previous weeks.
MA5c: the 5-week moving average of the 5 previous weeks.
The slide_index_dbl is a slider package function designed to run a pre-specified function on a rolling window of a numeric (_dbl) variable, ordered by an index time (date or date-time) variable. A full description of this function can be found by running the following command: ?slide_index_dbl
The .x argument provides the vector with the numbers that will be used for computation.
The .i argument defines the vector with the time variable that will be used for ordering the data.
The .f argument indicates the function that will be used to perform the intended computations. In this case, an arithmetic mean computation is carried out by using the mean function. The na.rm = TRUE is a varying argument included as a ... argument of slide_index_dbl. (Run the expression args("slide_index_dbl") and take notice on the position/sequence of this function’s arguments).
The .before argument defines the number of observations before the central time point of a given time window to use in the rolling window computation. In the example above for a 5 time points centered moving average, in MA5a, 2 observations are used before the central point, and 2 observations are used after it, using the .after argument.
The complete argument indicates where the computation should be carried in rolling windows that have complete observations.
We applied a more complicated function to compute MA5c, taking a backwards moving window of six values but omitting the latest value (current time point) from the calculation of the average.
Show the code
## wide format dataset ---colors <-c("cases"="black", "MA5a"="red", "MA5b"="blue", "MA5c"="brown")ggplot(data = mortzma, mapping =aes(x = date_index)) +geom_line(mapping =aes(y = cases, colour ="cases")) +geom_line(mapping =aes(y = MA5a, colour ="MA5a")) +geom_line(mapping =aes(y = MA5b, colour ="MA5b")) +geom_line(mapping =aes(y = MA5c, colour ="MA5c")) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Week", y ="Number of deaths", title ="Spain: number of deaths - moving averages",colour="Legend") +scale_color_manual(values = colors, name="") + tsa_theme
We observe that the calculation is similar across these various methods, but is not aligned to the series in the same way. MA5a is centered in the middle of the period used to calculate the mean. MA5b is placed at the end of the period. MA5c is placed one step forward (smoothing functions can be used for forecasting the following point). The “models” provided are similar for a 5-week window, but the lag is different.
Moving average is only one way of smoothing. Other ways of smoothing the data to get a general idea of the trend include, for example, LOESS (locally estimated scatterplot smoothing) smoothing, where the contribution of surrounding observations is weighted, i.e. it is not the arithmetical mean for each set (window) of observations.
Why do you think we have chosen these windows for the moving average?
To better observe the general trend, we need to find the length of the moving average that will erase the seasonal component. Various lengths can be tried; here we have used 25, 51 and 103.
You can use View to examine the first rows of mortzma2.
Show the code
View(mortzma2)
Plot the times series with the different moving averages.
Show the code
## wide format dataset ---colors <-c("cases"="black", "MA25"="red", "MA51"="blue", "MA103"="orange")ggplot(data = mortzma2, mapping =aes(x = date_index)) +geom_line(mapping =aes(y = cases, colour ="cases"), linewidth=1) +geom_line(mapping =aes(y = MA25, colour ="MA25"), linewidth=1.2) +geom_line(mapping =aes(y = MA51, colour ="MA51"), linewidth=1.2) +geom_line(mapping =aes(y = MA103, colour ="MA103"), linewidth=1.2) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Year-Week", y ="Number of deaths", title ="Spain: number of deaths - moving averages with 26, 51 and 103-week windows",colour="Legend") +scale_color_manual(values = colors, name="", breaks =names(colors)[c(1,2,3,4)]) + tsa_theme
Moving average is only one way of smoothing. Other ways of smoothing the data to get a general idea of the trend include, for example, LOESS (locally estimated scatterplot smoothing) smoothing, where the contribution of surrounding observations is weighted, i.e. it is not the arithmetical mean for each set (window) of observations.
The geom_smooth provides a smoothing curve using a LOESS method.
Show the code
ggplot(mortz, mapping =aes(x = date_index, y = cases)) +geom_point(alpha =0.5) +geom_smooth(se =TRUE) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Date", y ="Number of Deaths", title ="Loess smoothing") + tsa_theme
We can change the degree of smoothing by adding a span option. The span controls the amount of smoothing for the default loess smoother; smaller numbers produce wigglier lines, and larger numbers produce smoother lines.
Show the code
ggplot(data = mortz, mapping =aes(x = date_index, y = cases)) +geom_point(alpha =0.5) +geom_smooth(se =TRUE, span =0.1) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Date", y ="Number of Deaths", title ="Loess smoothing (span = 0.1)") + tsa_theme
In ggplot(), we can also use stat_smooth, which provides a smoothing curve using a LOESS method, and were wee can change the degree of smoothing by adding a span option. The span controls the amount of smoothing for the default loess smoother; smaller numbers produce wigglier lines, and larger numbers produce smoother lines.
Show the code
ggplot(data = mortz, mapping =aes(x = date_index, y = cases)) +geom_point(alpha =0.5) +stat_smooth(se =TRUE, span=0.1) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Year-Week", y ="Number of deaths", title ="Spain: number of deaths - loess smoothing (span = 0.1)") + tsa_theme
Test different values of span. Comment on the lines provided by the different smoothing windows. Which one do you think is the best for eliminating seasonality? What would happen if you used an even greater window?
Task 4.3
Assess statistically whether the registered number of deaths has been stable in the years available in your dataset.
Help for Task 4.3
Using regression methods against the time variable is a simple and familiar way to look at trends and test the slope. The kind of regression and interpretation you will use depends on the nature of your dependent variable (outcome), in this case, the number (count) of registered deaths. Time will be represented by a variable that simply enumerates the observations; in other words, time is a variable with values from 1 (the first observation in time; here week 1, 2010) to n (the last observation in time; here week 52, 2019).
Call:
glm(formula = cases ~ index, family = "poisson", data = mortz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.910e+00 1.007e-03 8850.25 <2e-16 ***
index 1.895e-04 3.302e-06 57.39 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 62509 on 520 degrees of freedom
Residual deviance: 59215 on 519 degrees of freedom
AIC: 64841
Number of Fisher Scoring iterations: 4
The glm function fits a linear regression model to the log of the weekly number of deaths. You can use the names function to see the various components of the output of the glm function, as above. As mentioned in the previous session, the str function is also often useful for looking at the “structure” of the output of an R function. Summary provides a quick function for checking the results.
Variables in R models are specified using notation along the lines of response_variable ~ explanatory_variable_1 + explanatory_variable_2 etc. Check the material from the MVA module. A more detailed explanation of how to build model formulas in R can also be found in the Details section of the help page of the formula function: ?stats::formula.
summary, when given the results of fitting a regression model, will provide a summary of the fitted parameters (coefficients); in this case the parameter of index is the trend.
Identify and interpret the intercept and the trend. In a Poisson model on the log(y), the coefficient needs to be exponentiated in order to interpret it for the output y (that is, the original y without the log).
Plot the fitted values against the observed ones.
Show the code
colors <-c("observed"="black", "fitted"="blue")ggplot(data = mortz, mapping =aes(x = date_index)) +geom_point(mapping =aes(y = cases, colour="observed"), alpha =0.5) +geom_line(mapping =aes(y =fitted(mort_poissontrend), colour ="fitted"), linewidth=1.1) +scale_y_continuous(limits =c(0, NA)) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +labs(x ="Year-Week", y ="Number of deaths", title ="Spain: number of deaths per week with fitted trend") +scale_color_manual(values = colors, name="") + tsa_theme
The trend as incidence rate ratio (IRR) provides the average relative change in the number of deaths from one week to the next.
Show the code
# Trend in the original scale of the outcome cases (deaths) wit 95% confidence intervals. IRR_trend <-exp(cbind(Estimate=mort_poissontrend$coef, confint(mort_poissontrend)))(IRR_trend['index', ] -1) *100
The result above indicates that from one week to another, the number of deaths increases in average 0,019% per week, with a 95% confidence interval of [0,018%; 0,020%].
Another way in R to obtain the incidence rate ratio (IRR) is:
This code creates a regression table using the tbl_regression() function from the gtsummary package. It has two arguments: - exponentiate = T (or TRUE) This tells R to exponentiate the coefficients. - estimate_fun = ~style_number(.x, digits = 4) This controls how the numbers in your table are formatted. .x represents each coefficient value, and style_number() formats it to 4 decimal places.
Task 4.4
How could you take the changing population size of Spain into account in your analyses?
Help for Task 4.4
To take the population into account, we need to include the population in the model. This is done by adding a term called offset(). The parameter beta for the offset, in this case for the population, will be forced to be 1; in other words, the offset will not affect the outcome. The advantage is that we can interpret the IRR of time in terms of mortality rate = cases/population.
Show the code
mort_poissontrend_pop <-glm(cases ~ index +offset(log(pop)),data = mortz,family ="poisson" )summary(mort_poissontrend_pop)
Call:
glm(formula = cases ~ index + offset(log(pop)), family = "poisson",
data = mortz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.746e+00 1.006e-03 -8691.66 <2e-16 ***
index 1.863e-04 3.299e-06 56.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 62588 on 520 degrees of freedom
Residual deviance: 59399 on 519 degrees of freedom
AIC: 65025
Number of Fisher Scoring iterations: 4
Show the code
# Trend in the original scale of the outcome cases (deaths) with 95% confidence intervals. IRR_trend_pop <-exp(cbind(Estimate=mort_poissontrend_pop$coef, confint(mort_poissontrend_pop)))(IRR_trend_pop['index', ] -1) *100
The summary() function will only provide the coefficients in the original log scale of the poisson model. Another way of obtaining the IRR is by using tbl_regression with the argument exponentiate = TRUE:
Compare the trend and IRR in the models with and without population.
For communication, it might be better to present the results in terms of number of deaths per 100.000 inhabitants (population), which would then be called the mortality rate. We will now plot the same trend but as rate per 100.000 inhabitants. For this purpose we must adjust both the observed and the fitted values by multiplying with 100.000.
Show the code
mortz<- mortz %>%mutate(mortality_rate = cases/pop*100000 )ggplot(data = mortz, mapping =aes(x = date_index)) +geom_point(mapping =aes(y = mortality_rate), alpha =0.5) +geom_line(mapping =aes(y =fitted(mort_poissontrend_pop)/pop*100000), colour ="blue", lwd=1.1) +scale_y_continuous(limits =c(0, NA)) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +labs(x ="Week", y ="Mortality rate (per 100.000 pop)",title ="Spain: Mortality rate per 100.000 population per week with fitted trend" ) + tsa_theme
How do you interpret the IRR in the model for mortality rate?
Task 4.5 (Optional)
Carry out the same analysis for men and women. Do you see differences between the two? where?
Save
Save mortagg to be used in the next practical.
Show the code
# Save the mortagg and mortz objects.#save(mortagg, mortz, file = here("data", "mortagg_case_4.RData"))
# Practical Session 4: Smoothing and Trends {.unnumbered}```{r, knitr-global-chunk-04t}rm(list=ls())# Install pacman if not installed already, and activate itif (!require("pacman")) install.packages("pacman")# Install, update and activate librariespacman::p_load( here, rio, skimr, tsibble, ISOweek, slider, # for rolling means pander, tidyverse, gtsummary)# Create tsa_themetsa_theme <-theme_bw() +theme(plot.title =element_text(face ="bold", size =12),legend.background =element_rect(fill ="white", size =4, colour ="white"),legend.justification =c(0, 1),legend.position ="bottom",panel.grid.minor =element_blank() )```## Mortality Surveillance data in Spain {#title-4-1 .unnumbered}The number of deaths in a country can be used as a surveillance indicator. An example of such an application is EuroMOMO, which monitors number of fatalities in European countries on a weekly basis (https://www.euromomo.eu/). The weekly time series on the number of deathls are routinely used to follow seasonal infections, for example influenza and covid 19 during the winter, but also heatwaves in the summer.In case studies 4 to 7, we will analyse the number of deaths retrieved from the Spanish national statistical institute (INE) (https://ine.es/). Each case study will build up from the analysis conducted in the previous case study. The data contains the number of fatalities per week for the entire country, from 2010 to 2019, and includes variables on year, week, total number of deaths (`cases`), and total population; the same data is also included for men and women.## Expected learning outcomes {#learn-4 .unnumbered}By the end of the session, participants should be able to:- Describe, test and fit a trend in surveillance data (simple smoothing and regression);- Assess and interpret the significance of trend in surveillance data.Save any changes to a new dataset named \``mortagg`.## Task 4.1 {#task-4-1 .unnumbered}Assess visually how the total registered number of deaths has been behaving in the years with available data.Discuss your results with your peers.### Moving average and direct statistical modelling of trends {#task-4-1-1a .unnumbered}Moving averages are simple methods to visualise the general trend of a series after removing some of the random day-to-day variation by smoothing the data. This allows you to browse your data for periodicity and observe the general trend. In other words, smoothing the data may remove “noise” from your time series and can facilitate visual interpretation.Moving averages model a time series by calculating the numerical mean of the values adjacent to it. It is calculated for each observation, moving along the time axis: e.g. at each time $t$ and for a window of 5 time units, one way of calculating the moving average is by using the observations at $t-2$, $t-1$, $t$, $t+1$ and $t+2$.## Help for Task 4.1 {#solution-4-1 .unnumbered .unnumbered}Load the `mortagg.Rdata` dataset.```{r, task-4-1-import-data}load(here("data", "mortagg2.Rdata"))```Inspect the data.```{r, task-4-1-summary-mort}str(mortagg)summary(mortagg)view(mortagg)``````{r, task-4-1-describe-mort}skimr::skim(mortagg) ```Create a time series object with `tsibble`, using the total case counts and plot the time series.```{r, task-4-1-mort-aggr-week-plot-tidy}mortz <- mortagg %>%mutate(date_index =make_yearweek(year = year, week = week)) %>%as_tsibble(index = date_index)ggplot(data = mortz) +geom_line(mapping =aes(x = date_index, y = cases)) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +labs(x ="Year Week", y ="Number of Deaths", title ="Spain: number of deaths per week") + tsa_theme``````{r, task-4-1-mort-aggr-week-plot2-tidy}## Adjusting y-axis scaleggplot(data = mortz) +geom_line(mapping =aes(x = date_index, y = cases)) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Date", y ="Number of Deaths", title ="Spain: number of deaths per week") + tsa_theme```Create a similar plot for men and women.```{r, task-opt-4-1-mort-sex-aggr-week-plots-tidy}## Men and Women in wide formatcolors <-c("cases_m"="green", "cases_f"="black")ggplot(data = mortz, aes(x = date_index)) +geom_line(mapping =aes(y = cases_m, colour ="cases_m"), lwd=1) +geom_line(mapping =aes(y = cases_f, colour ="cases_f"), lwd=1) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Week", y ="Number of Deaths", title ="Spain: number of deaths per week and by sex", colour ="Legend") +scale_color_manual(values = colors, name="", labels =c("Men", "Women")) + tsa_theme```## Task 4.2 {#task-4-2 .unnumbered .unnumbered}Assess visually the existence of a long-term trend and periodicity in the `mortaagg` dataset by using smoothing techniques.Consider: where would you centre the smoothing window and why? What is the effect of using different smoothing window centring options? Test!Of all the methods applied here (moving average with different windows, loess with different spans), which one do you think is the best for eliminating seasonality? Where would you centre the smoothing window and why? What would happen if you used an even greater window? Test!## Help for Task 4.2 {#solution-4-2 .unnumbered}According to the `slider` package description, `slider` is a package for rolling windows analysis. It means that a given function is repeatedly applied to different “windows” of your data as you step through it. Typical examples of applications of rolling window functions include moving averages or cumulative sums. An introduction vignette for slider can be found by running the following command: `vignette("slider")`For each record, create the following various types of moving average.- `MA5a`: the 5-week moving average, centered on cases.- `MA5b`: the 5-week moving average of cases and the 4 previous weeks.- `MA5c`: the 5-week moving average of the 5 previous weeks.Compare results.```{r, task-4-2-mort-MAx}mortzma <- mortz %>%mutate(MA5a =slide_index_dbl(.x = cases,.i = date_index,.f = mean,na.rm =TRUE,.before =2,.after =2,.complete =TRUE ),MA5b =slide_index_dbl(.x = cases,.i = date_index,.f = mean,na.rm =TRUE,.before =4,.complete =TRUE ),MA5c =slide_index_dbl(.x = cases,.i = date_index,.f =function(x) mean(x[-6], na.rm =TRUE),.before =5,.complete =TRUE ) )# view first 10 lines of datahead(mortzma, 10)```The `slide_index_dbl` is a `slider` package function designed to run a pre-specified function on a rolling window of a numeric (`_dbl`) variable, ordered by an index time (date or date-time) variable. A full description of this function can be found by running the following command: `?slide_index_dbl`The `.x` argument provides the vector with the numbers that will be used for computation.The `.i` argument defines the vector with the time variable that will be used for ordering the data.The `.f` argument indicates the function that will be used to perform the intended computations. In this case, an arithmetic mean computation is carried out by using the `mean` function. The `na.rm = TRUE` is a varying argument included as a `...` argument of `slide_index_dbl`. (Run the expression `args("slide_index_dbl")` and take notice on the position/sequence of this function's arguments).The `.before` argument defines the number of observations before the central time point of a given time window to use in the rolling window computation. In the example above for a 5 time points centered moving average, in `MA5a`, 2 observations are used before the central point, and 2 observations are used after it, using the `.after` argument.The `complete` argument indicates where the computation should be carried in rolling windows that have complete observations.We applied a more complicated function to compute `MA5c`, taking a backwards moving window of six values but omitting the latest value (current time point) from the calculation of the average.```{r task-4-2-mort-MAx-plots-tidy, message=FALSE, warning=FALSE}## wide format dataset ---colors <-c("cases"="black", "MA5a"="red", "MA5b"="blue", "MA5c"="brown")ggplot(data = mortzma, mapping =aes(x = date_index)) +geom_line(mapping =aes(y = cases, colour ="cases")) +geom_line(mapping =aes(y = MA5a, colour ="MA5a")) +geom_line(mapping =aes(y = MA5b, colour ="MA5b")) +geom_line(mapping =aes(y = MA5c, colour ="MA5c")) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Week", y ="Number of deaths", title ="Spain: number of deaths - moving averages",colour="Legend") +scale_color_manual(values = colors, name="") + tsa_theme```We observe that the calculation is similar across these various methods, but is not aligned to the series in the same way. `MA5a` is centered in the middle of the period used to calculate the mean. `MA5b` is placed at the end of the period. `MA5c` is placed one step forward (smoothing functions can be used for forecasting the following point). The "models" provided are similar for a 5-week window, but the *lag* is different.Moving average is only one way of smoothing. Other ways of smoothing the data to get a general idea of the trend include, for example, LOESS (locally estimated scatterplot smoothing) smoothing, where the contribution of surrounding observations is weighted, i.e. it is not the arithmetical mean for each set (window) of observations.Why do you think we have chosen these windows for the moving average?To better observe the general trend, we need to find the length of the moving average that will erase the seasonal component. Various lengths can be tried; here we have used 25, 51 and 103.```{r, task-4-2-mort-MAx-longer-length}mortzma2 <- mortz %>%mutate(MA25 =slide_index_dbl(.x = cases,.i = date_index,.f = mean,na.rm =TRUE,.before =12,.after =12,.complete =TRUE ),MA51 =slide_index_dbl(.x = cases,.i = date_index,.f = mean,na.rm =TRUE,.before =25,.after=25,.complete =TRUE ),MA103 =slide_index_dbl(.x = cases,.i = date_index,.f = mean,na.rm =TRUE,.before =51,.after =51,.complete =TRUE ) )## Visually inspect dataslice(mortzma2, 20:30)slice(mortzma2, 45:55)slice(mortzma2, 95:105)```You can use `View` to examine the first rows of `mortzma2.````{r, task-4-2-view-mortzma2, eval=FALSE}View(mortzma2)```Plot the times series with the different moving averages.```{r, task-4-2-mort-MAx-longer-length-plots-tidy}## wide format dataset ---colors <-c("cases"="black", "MA25"="red", "MA51"="blue", "MA103"="orange")ggplot(data = mortzma2, mapping =aes(x = date_index)) +geom_line(mapping =aes(y = cases, colour ="cases"), linewidth=1) +geom_line(mapping =aes(y = MA25, colour ="MA25"), linewidth=1.2) +geom_line(mapping =aes(y = MA51, colour ="MA51"), linewidth=1.2) +geom_line(mapping =aes(y = MA103, colour ="MA103"), linewidth=1.2) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Year-Week", y ="Number of deaths", title ="Spain: number of deaths - moving averages with 26, 51 and 103-week windows",colour="Legend") +scale_color_manual(values = colors, name="", breaks =names(colors)[c(1,2,3,4)]) + tsa_theme```Moving average is only one way of smoothing. Other ways of smoothing the data to get a general idea of the trend include, for example, LOESS (locally estimated scatterplot smoothing) smoothing, where the contribution of surrounding observations is weighted, i.e. it is not the arithmetical mean for each set (window) of observations.The `geom_smooth` provides a smoothing curve using a LOESS method.```{r task-4-2-mort-MAx-scatterplot-tidy, message=FALSE, warning=FALSE}ggplot(mortz, mapping =aes(x = date_index, y = cases)) +geom_point(alpha =0.5) +geom_smooth(se =TRUE) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Date", y ="Number of Deaths", title ="Loess smoothing") + tsa_theme```We can change the degree of smoothing by adding a `span` option. The `span` controls the amount of smoothing for the default loess smoother; smaller numbers produce wigglier lines, and larger numbers produce smoother lines.```{r, task-4-2-mort-MAx-scatterplot-span-tidy2}ggplot(data = mortz, mapping =aes(x = date_index, y = cases)) +geom_point(alpha =0.5) +geom_smooth(se =TRUE, span =0.1) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Date", y ="Number of Deaths", title ="Loess smoothing (span = 0.1)") + tsa_theme```In ggplot(), we can also use `stat_smooth`, which provides a smoothing curve using a LOESS method, and were wee can change the degree of smoothing by adding a `span` option. The `span` controls the amount of smoothing for the default loess smoother; smaller numbers produce wigglier lines, and larger numbers produce smoother lines.```{r, task-4-2-mort-MAx-scatterplot-span-tidy}ggplot(data = mortz, mapping =aes(x = date_index, y = cases)) +geom_point(alpha =0.5) +stat_smooth(se =TRUE, span=0.1) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +scale_y_continuous(limits =c(0, NA)) +labs(x ="Year-Week", y ="Number of deaths", title ="Spain: number of deaths - loess smoothing (span = 0.1)") + tsa_theme```Test different values of span. Comment on the lines provided by the different smoothing windows. Which one do you think is the best for eliminating seasonality? What would happen if you used an even greater window?## Task 4.3 {#task-4-3 .unnumbered .unnumbered}Assess statistically whether the registered number of deaths has been stable in the years available in your dataset.## Help for Task 4.3 {#solution-4-3 .unnumbered}Using regression methods against the time variable is a simple and familiar way to look at trends and test the slope. The kind of regression and interpretation you will use depends on the nature of your dependent variable (outcome), in this case, the number (count) of registered deaths. Time will be represented by a variable that simply enumerates the observations; in other words, time is a variable with values from 1 (the first observation in time; here week 1, 2010) to `n` (the last observation in time; here week 52, 2019).```{r, task-4-3-regression-poi}mortz <- mortz %>%mutate(index =seq.int(from =1, to =nrow(.)))mort_poissontrend <-glm(formula = cases ~ index,family ="poisson",data = mortz)```Check model results.```{r, task-4-3-regression-lm-summary}summary(mort_poissontrend)```The `glm` function fits a linear regression model to the log of the weekly number of deaths. You can use the `names` function to see the various components of the output of the `glm` function, as above. As mentioned in the previous session, the `str` function is also often useful for looking at the "structure" of the output of an R function. Summary provides a quick function for checking the results.Variables in R models are specified using notation along the lines of `response_variable ~ explanatory_variable_1 + explanatory_variable_2` etc. Check the material from the MVA module. A more detailed explanation of how to build model formulas in R can also be found in the *Details* section of the help page of the `formula` function: `?stats::formula`.`summary`, when given the results of fitting a regression model, will provide a summary of the fitted parameters (coefficients); in this case the parameter of `index` is the trend.Identify and interpret the intercept and the trend. In a Poisson model on the log(y), the coefficient needs to be exponentiated in order to interpret it for the output y (that is, the original y without the log).Plot the fitted values against the observed ones.```{r, task-4-3-regression-glm-plot-tidy}colors <-c("observed"="black", "fitted"="blue")ggplot(data = mortz, mapping =aes(x = date_index)) +geom_point(mapping =aes(y = cases, colour="observed"), alpha =0.5) +geom_line(mapping =aes(y =fitted(mort_poissontrend), colour ="fitted"), linewidth=1.1) +scale_y_continuous(limits =c(0, NA)) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +labs(x ="Year-Week", y ="Number of deaths", title ="Spain: number of deaths per week with fitted trend") +scale_color_manual(values = colors, name="") + tsa_theme```The trend as incidence rate ratio (IRR) provides the average relative change in the number of deaths from one week to the next.```{r, task-opt-4-3-IRR}# Trend in the original scale of the outcome cases (deaths) wit 95% confidence intervals. IRR_trend <-exp(cbind(Estimate=mort_poissontrend$coef, confint(mort_poissontrend)))(IRR_trend['index', ] -1) *100```The result above indicates that from one week to another, the number of deaths increases in average 0,019% per week, with a 95% confidence interval of \[0,018%; 0,020%\].Another way in R to obtain the incidence rate ratio (IRR) is:`{r: task-4-3-IRR-gtsummary} mort_poissontrend %>% gtsummary::tbl_regression( exponentiate = T, estimate_fun = ~style_number(.x, digits = 4) )`This code creates a regression table using the `tbl_regression()` function from the `gtsummary` package. It has two arguments: - exponentiate = T (or TRUE) This tells R to exponentiate the coefficients. - estimate_fun = \~style_number(.x, digits = 4) This controls how the numbers in your table are formatted. .x represents each coefficient value, and style_number() formats it to 4 decimal places.## Task 4.4 {#task-4-4 .unnumbered .unnumbered}How could you take the changing population size of Spain into account in your analyses?## Help for Task 4.4 {#solution-4-4 .unnumbered}To take the population into account, we need to include the population in the model. This is done by adding a term called `offset()`. The parameter beta for the offset, in this case for the population, will be forced to be 1; in other words, the offset will not affect the outcome. The advantage is that we can interpret the IRR of time in terms of mortality rate = cases/population.```{r, task-opt-4-4-regression-pois}mort_poissontrend_pop <-glm(cases ~ index +offset(log(pop)),data = mortz,family ="poisson" )summary(mort_poissontrend_pop)# Trend in the original scale of the outcome cases (deaths) with 95% confidence intervals. IRR_trend_pop <-exp(cbind(Estimate=mort_poissontrend_pop$coef, confint(mort_poissontrend_pop)))(IRR_trend_pop['index', ] -1) *100```The summary() function will only provide the coefficients in the original log scale of the poisson model. Another way of obtaining the IRR is by using `tbl_regression` with the argument `exponentiate = TRUE`:```{r: task-4-4-regression-pois-pop-IRR} mort_poissontrend_pop %>% gtsummary::tbl_regression( exponentiate =TRUE,estimate_fun =~style_number(.x, digits =4) )```plot the predicted and the original values:```{r, task-opt-4-4-regression-pois-pop_plot-tidy}ggplot(data = mortz, mapping =aes(x = date_index)) +geom_point(mapping =aes(y = cases/pop), alpha =0.5) +geom_line(mapping =aes(y =fitted(mort_poissontrend_pop)/pop), colour ="blue", lwd=1.1) +scale_y_continuous(limits =c(0, NA)) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +labs(x ="Week", y ="Mortality rate",title ="Spain: Mortality rate per week with fitted trend" ) + tsa_theme```Compare the trend and IRR in the models with and without population.For communication, it might be better to present the results in terms of number of deaths per 100.000 inhabitants (population), which would then be called the mortality rate. We will now plot the same trend but as rate per 100.000 inhabitants. For this purpose we must adjust both the observed and the fitted values by multiplying with 100.000.```{r, task-opt-4-3-regression-pois-mortrate-plot-tidy}mortz<- mortz %>%mutate(mortality_rate = cases/pop*100000 )ggplot(data = mortz, mapping =aes(x = date_index)) +geom_point(mapping =aes(y = mortality_rate), alpha =0.5) +geom_line(mapping =aes(y =fitted(mort_poissontrend_pop)/pop*100000), colour ="blue", lwd=1.1) +scale_y_continuous(limits =c(0, NA)) +scale_x_yearweek(date_labels ="%Y-%W", date_breaks ="1 year") +labs(x ="Week", y ="Mortality rate (per 100.000 pop)",title ="Spain: Mortality rate per 100.000 population per week with fitted trend" ) + tsa_theme```How do you interpret the IRR in the model for mortality rate?## Task 4.5 (Optional) {#task-4-5 .unnumbered}Carry out the same analysis for men and women. Do you see differences between the two? where?## SaveSave mortagg to be used in the next practical.```{r, task-4-save}# Save the mortagg and mortz objects.#save(mortagg, mortz, file = here("data", "mortagg_case_4.RData"))```