Show the code
rota <- import(here("data", "rotavirus.csv"))
view(rota)By the end of the case study, participants will be able to:
Rotavirus is a very common and potentially serious infection of the large bowel, mainly affecting young babies. Nearly every child will have at least one episode of rotavirus gastroenteritis by five years of age. People of any age can be affected, but the illness is more severe in young infants. The rotavirus immunisation programme was introduced in the UK on 1 July 2013 (week 27) with the objective of preventing a significant number of young infants from developing rotavirus infection. It may also provide some additional protection to the wider population through herd immunity. The aim of the rotavirus immunisation programme is to provide two doses of vaccine to infants from six weeks of age and before 24 weeks of age. The first dose of vaccine is offered at approximately eight weeks of age and the second dose at least four weeks after the first dose.
High coverage was rapidly achieved for the first cohort of children offered rotavirus vaccine routinely in England and this has been maintained throughout the first fourteen months of the routine programme. Over this period, rotavirus vaccine coverage for children in the routine cohort averaged 93.3% for one dose and 88.3% for two doses.1
You have been given a dataset for notified Rotavirus infections in England and Wales from week 23/2009 up to week 42/2018 (rotavirus.csv).
Assess the impact of the introduction of the vaccination scheme in England and Wales for the whole population. Has the vaccine introduction had an effect on trend or periodicity patterns? Has the vaccine introduction affected the timing of the yearly outbreaks?
Open the rotavirus dataset, which is a line list of all notified rotavirus infections in the UK between June 2009 and December 2014. Prepare your data and test if the introduction of the vaccine has had an impact on the notified number of cases of rotavirus infections in the UK. Has the trend changed? Has the periodicity changed?
Prepare the rotavirus.csv for time-series analysis. You need to manipulate the dataset before being able to use it for time series analysis.
rota <- import(here("data", "rotavirus.csv"))
view(rota)As the data in rota shows above, this dataset is structured in a “wide” format. You can see two variables representing dates, and all other variables containing case counts are organised horizontally. Alternatively, we can store the same information by converting the first dataset to a “long” format, where one variable contains the different age group categories of cases (cases_cat), and another displays the corresponding case counts (n) for each age group and year-week.
In order to use ggplot2, remember that the data should be organised in a long format.
The function as_tsibble() is what indicates to R that the data is a time series. index represents the time variable and key represents any categorical variable that splits the data into several time series (eg: location, age group…).
## convert to long format
rota_lg <- rota %>%
pivot_longer(
cols = -c(year, week),
names_to = "cases_cat",
values_to = "n"
)
## recode
rota_lg <- rota_lg %>%
mutate(cases_cat = case_when(
cases_cat == "case1" ~ "Cases <1 year",
cases_cat == "case2" ~ "Cases 1 to 4 years",
cases_cat == "case3" ~ "Cases 5 to 14 years",
cases_cat == "case4" ~ "Cases 15 to 64 years",
cases_cat == "case5" ~ "Cases 65+ years",
cases_cat == "cases" ~ "Total cases",
TRUE ~ NA_character_
))
## turn into ts format and make sure the data is complete
rota_lg <- rota_lg %>%
mutate(year_week = make_yearweek(year = year, week = week)) %>%
group_by(cases_cat) %>%
mutate(index = seq.int(from = 1, to = length(unique(year_week)))) %>%
ungroup() %>%
as_tsibble(index = year_week, key = cases_cat) %>%
complete(index, cases_cat, fill=list(n = 0))Plot the total number of cases by age group against time. Note that .
ggplot(data = rota_lg) +
geom_line (mapping = aes(x = year_week, y = n), alpha = 0.3) +
facet_wrap(facets = vars(cases_cat),
scales = "free_y") +
scale_y_continuous(limits = c(0, NA)) +
scale_x_yearweek(date_labels = "%Y-%W",
date_breaks = "1 year") +
labs(
x = "Year",
y = "Weekly case counts",
title = "Rotavirus cases\n(note differing y scales)"
) +
tsa_theme## Notes:
# You can get line breaks in labels using \n
# You can also make all y-axes the same by using scales = "fixed_y"We want to see whether the number of cases changed after the introduction of the vaccine.
For that, first we need to state in the dataset when the vaccine was available. Firstly, we are adding one new vaccine variable to the rota_ts time series. It will be 0 for weeks before week 27 2013, and 1 thereafter.
We have also created sine and cosine variables so that we can model annual seasonality as before.
rota_ts <- rota_lg %>%
mutate(
vaccine = case_when(
year >= 2014 ~ 1,
year <= 2012 ~ 0,
year == 2013 & week >= 27 ~ 1,
year == 2013 & week < 27 ~ 0,
TRUE ~ NA_real_
),
sin52 = sin(2 * pi * index / 52),
cos52 = cos(2 * pi * index / 52)
)
## subset to total cases
rota_ts <- rota_ts %>%
filter(cases_cat == "Total cases")Run first a separate model for the time before and after the introduction of the vaccine in England and Wales for all population. Can you detect any change in trend?
Note in the code that we are building further on the tbl_regression function, adding options.
What we saw before:
add_glance_table() to add key parameters at the bottom of the table.table_merge() to take several tables built with tbl_regression and put them in one big table.What we are adding now: - style_pvalue to format the pvalue (here with max 3 digits) - modify_caption to add a table title
rotamodel_prevacc_p <- glm(n ~ index,
data = rota_ts %>% filter(vaccine == 0),
family = "poisson"
)
#If the `summary` command is used for a `glm` Poisson model, the log rate ratios are reported by
#default. They need to be exponentiated
summary_rotamodel_prevacc_p <- rotamodel_prevacc_p %>%
gtsummary::tbl_regression(
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3),
estimate_fun = ~style_number(.x, digits = 4)
) %>%
gtsummary::add_glance_table(include = c(AIC, deviance, df.residual)) %>%
gtsummary::modify_caption("Rotavirus model pre-vaccination - Poisson")
summary_rotamodel_prevacc_p| Characteristic | IRR | 95% CI | p-value |
|---|---|---|---|
| index | 1.0028 | 1.0027, 1.0029 | <0.001 |
| AIC | 78,347 | ||
| Deviance | 76,884 | ||
| Residual df | 210 | ||
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||
Is a Poisson model the best option?
Before we have a look at the period post-vaccination, is a Poisson model really the best model to use here? We need to check whether our data exhibit overdispersion.
Overdispersion occurs when the mean of the data is not equal to its variance, violating the fundamental assumption of the Poisson distribution. If this happens, the model underestimates the standard error of the coefficients, biasing results towards narrower confidence intervals and artifically lower p-values.
Let’s check for overdispersion. We can calculate an overdispersion parameter from a model as the ratio of the deviance and the residuals degrees of freedom. Also, proper statistical test are available, for example using the AER library and the dispersiontest function
# Quick check for overdispersion: if check > 1 then use negative binomial
deviance <- summary(rotamodel_prevacc_p)$deviance
residual.df <- summary(rotamodel_prevacc_p)$df.residual
check <- deviance / residual.df
check[1] 366.1138
# Proper check for overdispersion:
AER::dispersiontest(rotamodel_prevacc_p, trafo = NULL, alternative = c("greater"))
Overdispersion test
data: rotamodel_prevacc_p
z = 7.2434, p-value = 2.188e-13
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
446.172
The overdispersion parameter is fairly greater that 1, and the overdispersion test is significant, so we can safely conclude that a negative binomial model is more appropiate. Let’s compute it with the glm.nb function, and compare with the previous one
#so best to use a negative binomial here
rotamodel_prevacc_nb <- glm.nb(n ~ index,
data = rota_ts %>% filter(vaccine == 0))
# Table with results
summary_rotamodel_prevacc_nb <- rotamodel_prevacc_nb %>%
gtsummary::tbl_regression(
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3),
estimate_fun = ~style_number(.x, digits = 4)
) %>%
gtsummary::add_glance_table(include = c(AIC, deviance, df.residual)) %>%
gtsummary::modify_caption("Rotavirus model pre-vaccination - Negative Binomial")
# Let's compare the two models
comparison_p_nb <- gtsummary::tbl_merge(
tbls = list(summary_rotamodel_prevacc_p,
summary_rotamodel_prevacc_nb),
tab_spanner = c("Poisson model", "Negative Binomial model")
)
comparison_p_nb| Characteristic |
Poisson model
|
Negative Binomial model
|
||||
|---|---|---|---|---|---|---|
| IRR | 95% CI | p-value | IRR | 95% CI | p-value | |
| index | 1.0028 | 1.0027, 1.0029 | <0.001 | 1.0029 | 1.0005, 1.0054 | 0.015 |
| AIC | 78,347 | 2,854 | ||||
| Deviance | 76,884 | 248 | ||||
| Residual df | 210 | 210 | ||||
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | ||||||
The AIC comparison proves how the negative binomial model outperforms the Poisson model. Also, realize how the index coefficient shifted from a tiny negative to a tiny positive overal trend, and how the p-value has greatly increased (but it still significant)
Let’s use negative binomial models for the rest of the practical.
Now let’s look at the period post-introduction of the vaccine.
We will fit a model for the post-vaccination period and compare it with the pre-vaccination period
rotamodel_postvacc <- glm.nb(n ~ index,
data = rota_ts %>% filter(vaccine == 1))
# Table with results
summary_rotamodel_postvacc <- rotamodel_postvacc %>%
gtsummary::tbl_regression(
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3),
estimate_fun = ~style_number(.x, digits = 4)) %>%
gtsummary::add_glance_table(include = c(AIC, deviance, df.residual)) %>%
gtsummary::modify_caption("Rotavirus model post-vaccination")
# Comparisson between models
comparison_pre_post <- gtsummary::tbl_merge(
tbls = list(summary_rotamodel_prevacc_nb,
summary_rotamodel_postvacc),
tab_spanner = c("Pre-vaccination", "Post-vaccination")
)
comparison_pre_post| Characteristic |
Pre-vaccination
|
Post-vaccination
|
||||
|---|---|---|---|---|---|---|
| IRR | 95% CI | p-value | IRR | 95% CI | p-value | |
| index | 1.0029 | 1.0005, 1.0054 | 0.015 | 0.9977 | 0.9967, 0.9988 | <0.001 |
| AIC | 2,854 | 2,830 | ||||
| Deviance | 248 | 293 | ||||
| Residual df | 210 | 274 | ||||
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | ||||||
We can see a difference in trends, but the two models cannot be directly compared.
Now we want to run one single model for both periods (before/after the introduction of the vaccine)
rotamodel_trend <- glm.nb(n ~ index + vaccine,
data = rota_ts)
# Table with results
summary_rotamodel_trend <- rotamodel_trend %>%
gtsummary::tbl_regression(
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3),
estimate_fun = ~style_number(.x, digits = 4)) %>%
gtsummary::add_glance_table(include = c(AIC, deviance, df.residual)) %>%
gtsummary::modify_caption("Rotavirus model with trend and vaccine intervention")
summary_rotamodel_trend| Characteristic | IRR | 95% CI | p-value |
|---|---|---|---|
| index | 0.9995 | 0.9983, 1.0006 | 0.341 |
| vaccine | 0.2624 | 0.1940, 0.3551 | <0.001 |
| AIC | 5,768 | ||
| Deviance | 544 | ||
| Residual df | 485 | ||
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||
How would you interpret these results?
We can plot the model against the raw data.
rota_ts <- rota_ts %>%
mutate(fitted_trend = fitted(rotamodel_trend))
ggplot(data = rota_ts) +
geom_point(mapping = aes(x = year_week, y = n), alpha = 0.3) +
geom_line(mapping = aes(x = year_week, y = fitted_trend),
alpha = 0.7, colour = "green") +
scale_y_continuous(limits = c(0, NA)) +
scale_x_yearweek(date_labels = "%Y-%W",
date_breaks = "1 year") +
labs(x = "Year",
y = "Weekly case counts",
title = "Rotavirus cases: model with trend and intervention"
) +
tsa_themeNow we want to run one single model for both periods (before/after the introduction of the vaccine), this time including seasonality.
rotamodel_seasonality <- glm.nb(n ~ index + sin52 + cos52 + vaccine,
data = rota_ts)
summary_rotamodel_seasonality <- rotamodel_seasonality %>%
gtsummary::tbl_regression(
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3),
estimate_fun = ~style_number(.x, digits = 4)) %>%
gtsummary::add_glance_table(include = c(AIC, deviance, df.residual)) %>%
gtsummary::modify_caption("Rotavirus model with seasonality, trend and vaccine intervention")
summary_rotamodel_seasonality| Characteristic | IRR | 95% CI | p-value |
|---|---|---|---|
| index | 0.9989 | 0.9982, 0.9996 | 0.001 |
| sin52 | 0.4245 | 0.3951, 0.4560 | <0.001 |
| cos52 | 2.0527 | 1.9092, 2.2071 | <0.001 |
| vaccine | 0.4885 | 0.4034, 0.5919 | <0.001 |
| AIC | 5,225 | ||
| Deviance | 506 | ||
| Residual df | 483 | ||
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||
We can plot the model against the raw data again.
rota_ts <- rota_ts %>%
mutate(fitted_seasonality = fitted(rotamodel_seasonality))
ggplot(data = rota_ts) +
geom_point(mapping = aes(x = year_week, y = n), alpha = 0.3) +
geom_line(mapping = aes(x = year_week, y = fitted_seasonality),
alpha = 0.7, colour = "green") +
scale_y_continuous(limits = c(0, NA)) +
scale_x_yearweek(date_labels = "%Y-%W",
date_breaks = "1 year") +
labs(x = "Year",
y = "Weekly case counts",
title = "Rotavirus cases: model with trend, seasonality, and intervention"
) +
tsa_themeTo be able to comment on any change in trend after the introduction of the vaccine, we need to either include an interaction term in the model or include a time variable that starts at the introduction of the vaccine. Let’s here use an interaction term.
rotamodel_winteraction <- glm.nb(n ~ index * vaccine + sin52 + cos52,
data = rota_ts)
summary_rotamodel_trendinteraction <- rotamodel_winteraction %>%
gtsummary::tbl_regression(
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3),
estimate_fun = ~style_number(.x, digits = 4)) %>%
gtsummary::add_glance_table(include = c(AIC, deviance, df.residual)) %>%
gtsummary::modify_caption("Rotavirus model with seasonality, trend, vaccine and interaction between intervention and trend")
summary_rotamodel_trendinteraction| Characteristic | IRR | 95% CI | p-value |
|---|---|---|---|
| index | 1.0015 | 1.0003, 1.0027 | 0.015 |
| vaccine | 0.9833 | 0.7076, 1.3693 | 0.919 |
| sin52 | 0.4290 | 0.4001, 0.4600 | <0.001 |
| cos52 | 2.0434 | 1.9036, 2.1935 | <0.001 |
| index * vaccine | 0.9962 | 0.9948, 0.9977 | <0.001 |
| AIC | 5,202 | ||
| Deviance | 505 | ||
| Residual df | 482 | ||
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||
You can visualize the effect of the interaction term using plot_model() from the sjPlot library
sjPlot::plot_model(rotamodel_winteraction, type ="int") Have there been any changes in seasonality, too?
Note how we can use brackets to introduce interaction terms for several variables.
rotamodel_full <- glm.nb(n ~ (sin52 + cos52 + index)*vaccine,
data = rota_ts)
summary_rotamodel_full <- rotamodel_full %>%
gtsummary::tbl_regression(
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3),
estimate_fun = ~style_number(.x, digits = 4)) %>%
gtsummary::add_glance_table(include = c(AIC, deviance, df.residual)) %>%
gtsummary::modify_caption("Rotavirus model with seasonality, trend, vaccine and interaction term")
summary_rotamodel_fullsjPlot::plot_model(rotamodel_full, type ="int")[[1]]
[[2]]
[[3]]
Let’s plot the full model estimate.
rota_ts <- rota_ts %>%
mutate(fitted_full = fitted(rotamodel_full))
ggplot(data = rota_ts) +
geom_point(mapping = aes(x = year_week, y = n), alpha = 0.3) +
geom_line(mapping = aes(x = year_week, y = fitted_full),
alpha = 0.7, colour = "green") +
scale_y_continuous(limits = c(0, NA)) +
scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
labs(x = "Year",
y = "Weekly case counts",
title = "Rotavirus cases: model with trend, seasonality, intervention \n and interaction terms"
) +
tsa_themeWe can now compare all models next to each other:
comparison_all <- gtsummary::tbl_merge(
tbls = list(summary_rotamodel_seasonality,
summary_rotamodel_trendinteraction,
summary_rotamodel_full),
tab_spanner = c("Model without interaction terms",
"Interaction with trend only",
"Full model")
)
comparison_all| Characteristic |
Model without interaction terms
|
Interaction with trend only
|
Full model
|
||||||
|---|---|---|---|---|---|---|---|---|---|
| IRR | 95% CI | p-value | IRR | 95% CI | p-value | IRR | 95% CI | p-value | |
| index | 0.9989 | 0.9982, 0.9996 | 0.001 | 1.0015 | 1.0003, 1.0027 | 0.015 | 1.0006 | 0.9997, 1.0016 | 0.185 |
| sin52 | 0.4245 | 0.3951, 0.4560 | <0.001 | 0.4290 | 0.4001, 0.4600 | <0.001 | 0.2551 | 0.2361, 0.2756 | <0.001 |
| cos52 | 2.0527 | 1.9092, 2.2071 | <0.001 | 2.0434 | 1.9036, 2.1935 | <0.001 | 2.1715 | 1.9979, 2.3602 | <0.001 |
| vaccine | 0.4885 | 0.4034, 0.5919 | <0.001 | 0.9833 | 0.7076, 1.3693 | 0.919 | 0.9175 | 0.7086, 1.1897 | 0.504 |
| AIC | 5,225 | 5,202 | 4,945 | ||||||
| Deviance | 506 | 505 | 496 | ||||||
| Residual df | 483 | 482 | 480 | ||||||
| index * vaccine | 0.9962 | 0.9948, 0.9977 | <0.001 | 0.9971 | 0.9959, 0.9982 | <0.001 | |||
| sin52 * vaccine | 2.6312 | 2.3698, 2.9216 | <0.001 | ||||||
| cos52 * vaccine | 0.8702 | 0.7800, 0.9709 | 0.010 | ||||||
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||||||||
What about age? Can you see a difference in impact of vaccination depending on age category?
Discuss with colleagues how you can determine the number of cases prevented as a result of the introduction of the vaccine (Hint: refer to Practical 7!).
Rotavirus infant immunisation programme 2014/15: Vaccine uptake report for England (Published: June 2015; PHE publications gateway number: 2015141)↩︎