Univariable analysis

1. Learning outcomes

At the end of the session, participants will be able to:

  • Perform hypothesis tests.

  • Estimate risk ratios (also called relative risk) for categorical data.

  • Interpret the univariable results.

  • (Optional) Investigate dose-response relationships in categorical data.

2. Story/plot description

You have just estimated risk ratios (RR) of individual food items manually to better understand the concept. RR give you an idea of which food items could be the culprit(s) of the outbreak.

Now, you will practice performing hypothesis testing and calculating RR in R to investigate the associations of suspicious food items with the disease.

Note that, although this is not an exhaustive list of tests, the following can be useful for this exercise (relevant for comparisons between two groups):

  • For continuous variables:

    • shapiro.test() (for checking if normally distributed)

    • t.test() (for normal distributions)

    • wilcox.test() (for non-parametric testing. Used to determine if two numeric samples are from the same distribution, when their populations are not normally distributed or have unequal variance.)

  • For categorical variables:

    • chisq.test() (if all comparisons include at least 5 cases)

    • fisher.test() (if there are < 5 cases for any comparison)

3. Questions/Assignments

3.1. Install packages and load libraries

Code
# Load the required libraries into the current R session:
pacman::p_load(rio, 
               here, 
               tidyverse, 
               skimr,
               plyr,
               janitor,
               lubridate,
               gtsummary, 
               flextable,
               officer,
               epikit, 
               apyramid, 
               scales,
               EpiStats,
               broom)

3.2. Import your data

Code
# Import the raw data set:
copdata <- rio::import(here::here("data", "Spetses_clean2_2024.rds"))

3.3. Hypothesis tests for other variables

Check if the following variables are associated with being a case: age, sex, class and group.

a) age

With the Shapiro-Wilk test we check if the variables are following the normal distribution. The null hypothesis is that the data follow a normal distribution, therefore, rejecting the null hypothesis means that the data do not follow the normal distribution. A p-value below the cutoff for rejecting the null hypothesis, e.g., a p-value<0.05 means that we reject the null hypothesis that the data follow the normal distribution. For age, the p-value is <0.05, therefore we reject the null hypothesis that the data are normally distributed. As we see in the graph most frequently reported age is <20 years.

Code
# Check if age overall follows a normal distribution:
shapiro.test(copdata$age)

    Shapiro-Wilk normality test

data:  copdata$age
W = 0.31302, p-value < 2.2e-16
Code
# Can simply have a look at
hist(copdata$age)

Code
# Looking only at the students:
students <- copdata %>% 
  filter(group == "student")
hist(students$age)

Age overall (nor within the students’ group) is not normally distributed.

We compare the age for cases and non-cases using the Wilcoxon test that is used when the data are not normally distributed. The null hypothesis is that there is no difference in the age between the two groups compared. Given that p-value>0.05 we do not reject the null hypothesis.

Code
# Perform Wilcoxon rank sum test on age and sex:
wilcox.test(age ~ case, 
            data = copdata)

    Wilcoxon rank sum test with continuity correction

data:  age by case
W = 15934, p-value = 0.1512
alternative hypothesis: true location shift is not equal to 0

b) sex

Code
copdata %>% 
  select(sex, case) %>% 
  tbl_summary(by = case) %>% 
  add_p()
Characteristic FALSE, N = 1611 TRUE, N = 2161 p-value2
sex 0.3
    female 96 (60%) 117 (54%)
    male 65 (40%) 99 (46%)
1 n (%)
2 Pearson’s Chi-squared test

What do these results tell you - is there an association between sex and being a case?

Do these results differ from what you expected when looking at the descriptive figures (case proportions stratified by sex, or the age sex pyramid)? If so, why do you think this is?

c) class

Code
copdata %>% 
  select(class, case) %>% 
  tbl_summary(by = case) %>% 
  add_p()
Characteristic FALSE, N = 1611 TRUE, N = 2161 p-value2
class 0.090
    1 63 (43%) 68 (34%)
    2 44 (30%) 57 (29%)
    3 38 (26%) 73 (37%)
    Unknown 16 18
1 n (%)
2 Pearson’s Chi-squared test

In this case you have a 2x3 contingency table of class against disease status, we can use a chi-square here. P-value of the association between class and age is p-val = 0.09, suggesting that there is an association at the p-val = 0.20 level among at least one of the 3 different classes. Fellows may want to investigate this further, but in the remaining of the study this is not explored further.

A hypothesis could be that one of the classes sat together and, for whatever reason, they ate more of the contaminated food item at those tables. This could be further studied if we had the spatial distribution of where the people sat at the dinner. Another hypothesis is that one of the classes differ from other classes in some characteristics that made them more susceptible or more exposed to the infected food item.

d) group

Code
copdata %>% 
  select(group, case) %>% 
  tbl_summary(by = case) %>% 
  add_p()
Characteristic FALSE, N = 1611 TRUE, N = 2161 p-value2
group 0.2
    teacher 9 (5.6%) 6 (2.8%)
    student 152 (94%) 210 (97%)
1 n (%)
2 Pearson’s Chi-squared test

P-value of the association between group and age is p-val = 0.2, suggesting that there may be an association at the p-val = 0.20 level. Explanations are like above, with variable class.

Let’s do all together

Code
copdata %>% 
  select(sex, class, group, case) %>% 
  tbl_summary(by = case) %>% 
  add_p()
Characteristic FALSE, N = 1611 TRUE, N = 2161 p-value2
sex 0.3
    female 96 (60%) 117 (54%)
    male 65 (40%) 99 (46%)
class 0.090
    1 63 (43%) 68 (34%)
    2 44 (30%) 57 (29%)
    3 38 (26%) 73 (37%)
    Unknown 16 18
group 0.2
    teacher 9 (5.6%) 6 (2.8%)
    student 152 (94%) 210 (97%)
1 n (%)
2 Pearson’s Chi-squared test

3.4. Risk Ratios

The risk ratios of each food item (including the 2x2 table) are reported below. The output of the CS() command is two tables: one with the 2x2 table and one with the risk difference, the risk ratio and the attributable fraction among exposed as well as the attributable fraction among the population (and the confidence intervals for all the estimates). The Chi-square and the p-value are also reported. In the second part, a table with all the food items is printed including attack rates for exposed and unexposed as well as risk ratios and the 95% confidence intervals (CI ll and CI ul, for the lower and upper interval) and p-values.

a) Calculate 95% CI Risk Ratios for food

To see if food items (note these are categorical variables) are associated with being a case, calculate risk ratios and 95% confidence intervals for food items. You can calculate this individually for each food item (using CSTable()), or all at once (see hint below).

Create a food_vars vector containing food variables of interest and use the function CSTable() of the EpiStats package.

Using this hint, you will create a table with the name of exposure variables, the total number of exposed, the number of exposed cases, the attack rate among the exposed, the total number of unexposed, the number of unexposed cases, the attack rate among the unexposed, risk ratios, 95% percent confidence intervals, and p-values. Amazing, isn’t it? Have a look at ??CSTable to learn more.

Code
# You could use the EpiStats package for each food item
CS(copdata, "case", "feta")
$df1
          Cases Non Cases Total Risk
Exposed     156       115   271 0.58
Unexposed    60        42   102 0.59
Total       216       157   373 0.58

$df2
                Point estimate 95%CI.ll 95%CI.ul
Risk difference          -0.01    -0.12     0.10
Risk ratio                0.98     0.81     1.19
Prev. frac. ex.           0.02    -0.19     0.19
Prev. frac. pop           0.02       NA       NA
chi2(1)                   0.05       NA       NA
Pr>chi2                  0.826       NA       NA
Code
CS(copdata, "case", "sardines")
$df1
          Cases Non Cases Total Risk
Exposed     150       105   255 0.59
Unexposed    65        52   117 0.56
Total       215       157   372 0.58

$df2
                Point estimate 95%CI.ll 95%CI.ul
Risk difference           0.03    -0.08     0.14
Risk ratio                1.06     0.87     1.28
Attr. frac. ex.           0.06    -0.14     0.22
Attr. frac. pop           0.04       NA       NA
chi2(1)                   0.35       NA       NA
Pr>chi2                  0.553       NA       NA
Code
CS(copdata, "case", "eggplant")
$df1
          Cases Non Cases Total Risk
Exposed     123        93   216 0.57
Unexposed    83        60   143 0.58
Total       206       153   359 0.57

$df2
                Point estimate 95%CI.ll 95%CI.ul
Risk difference          -0.01    -0.12     0.09
Risk ratio                0.98     0.82     1.18
Prev. frac. ex.           0.02    -0.18     0.18
Prev. frac. pop           0.01       NA       NA
chi2(1)                   0.04       NA       NA
Pr>chi2                  0.837       NA       NA
Code
CS(copdata, "case", "moussaka")
$df1
          Cases Non Cases Total Risk
Exposed     201       137   338 0.59
Unexposed    14        22    36 0.39
Total       215       159   374 0.57

$df2
                Point estimate 95%CI.ll 95%CI.ul
Risk difference           0.21     0.04     0.37
Risk ratio                1.53     1.01     2.32
Attr. frac. ex.           0.35     0.01     0.57
Attr. frac. pop           0.32       NA       NA
chi2(1)                   5.64       NA       NA
Pr>chi2                  0.018       NA       NA
Code
# You can save time (and probably typos!) by creating a vector for food variables...
food_vars <- c("feta", "sardines", "eggplant", "moussaka", 
               "orzo", "greeksal", "dessert", "bread",  
               "champagne", "beer", "redwine", "whitewine")

# ...and using EpiStats::CSTable() to run all variables together!
CSTable(copdata, "case", food_vars)
$df
          Tot.Exp Cases.Exp AR.Exp% Tot.Unexp Cases.Unexp AR.Unexp%   RR CI.ll
orzo          338       202   59.76        36          13     36.11 1.65  1.06
moussaka      338       201   59.47        36          14     38.89 1.53  1.01
champagne     316       187   59.18        48          21     43.75 1.35  0.97
greeksal      211       114   54.03       154          95     61.69 0.88  0.73
dessert       149        90   60.40       198         106     53.54 1.13  0.94
beer          281       166   59.07        78          41     52.56 1.12  0.89
redwine        80        42   52.50       259         150     57.92 0.91  0.72
sardines      255       150   58.82       117          65     55.56 1.06  0.87
whitewine     260       150   57.69        98          54     55.10 1.05  0.85
bread         342       196   57.31        29          16     55.17 1.04  0.74
feta          271       156   57.56       102          60     58.82 0.98  0.81
eggplant      216       123   56.94       143          83     58.04 0.98  0.82
          CI.ul p(Chi2)
orzo       2.58   0.006
moussaka   2.32   0.018
champagne  1.89   0.044
greeksal   1.04   0.144
dessert    1.36   0.202
beer       1.42   0.303
redwine    1.14   0.393
sardines   1.28   0.553
whitewine  1.29   0.659
bread      1.46   0.823
feta       1.19   0.826
eggplant   1.18   0.837

b) Prepare the RR table for publication

Use flextable() and set_header_labels().

Code
rr_tbl <- CSTable(copdata, "case", food_vars) %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  flextable() %>% 
   set_header_labels(
     values = c("Food Item",
                "Total exposed",     
               "Cases exposed", 
               "AR among exposed",    
               "Total unexposed",
               "Cases unexposed",
               "AR among unexposed",
               "RR",         
               "95% lower CI",             
               "95% upper CI",
               "p-value"))

· What can you infer from this table?

· Considering the relative risks, which food or drink items do you think were most likely to be the vehicle(s) of infection in this outbreak?

· Do you think there are any confounders or effect modifiers? If so, how would you investigate these further?

The interesting results here are that the food items that are most suspicious are orzo, moussaka and champagne. Orzo as such is unlikely to be contaminated, but as you can see in the picture (and from the dinner night), it was served with pesto! Maybe it was the pesto? Who-ho-ho!

Before one jumps into conclusions, consider that this result could be due to confounding! Maybe orzo was “clean” but eaten by all the people who ate the food item that actually was contaminated!(Optional)

3.5. (Optional) Dose Response

Check for a dose-response relationship between the food items with the highest RR values (the top 3) and being a case.

a) Orzo

Using epitools::riskratio function:

Code
epitools::riskratio(copdata$orzoD,
                    copdata$case,
                    conf.level = 0.95)
$data
         Outcome
Predictor FALSE TRUE Total
    0        23   13    36
    1        32   38    70
    2        76  107   183
    3        28   57    85
    Total   159  215   374

$measure
         risk ratio with 95% C.I.
Predictor estimate     lower    upper
        0 1.000000        NA       NA
        1 1.503297 0.9257878 2.441057
        2 1.619168 1.0310538 2.542742
        3 1.857014 1.1730804 2.939696

$p.value
         two-sided
Predictor  midp.exact fisher.exact chi.square
        0          NA           NA         NA
        1 0.081364161  0.100893180 0.07613202
        2 0.015337953  0.017069765 0.01373981
        3 0.002010616  0.002370887 0.00162311

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

Using a binomial regression:

Code
# Binomial regression for RRs. 
# The outcome needs to be exponentiated so we can interpret it properly!
binom_orzoD <- glm(case ~ orzoD, data = copdata, 
             family = binomial(link = "log"))

# To get exponentiated:
binom_orzoD_exp <- glm(case ~ orzoD, data = copdata, 
                       family = binomial(link = "log")) %>% 
  tidy(exponentiate = TRUE, 
       conf.int = TRUE)

binom_orzoD_exp

What do these results tell you?

Results suggest a dose response relationship of having eaten orzo, pointing towards orzoas the potential vehicle. The higher the amount of orzothey ate, the stronger is the association (RR) with getting ill/being a case.

b) Moussaka

Using epitools::riskratio function:

Code
epitools::riskratio(copdata$moussakaD,                     
                    copdata$case,                     
                    conf.level = 0.95)
$data
         Outcome
Predictor FALSE TRUE Total
    0        19   15    34
    1        30   40    70
    2        78  106   184
    3        33   54    87
    Total   160  215   375

$measure
         risk ratio with 95% C.I.
Predictor estimate     lower    upper
        0 1.000000        NA       NA
        1 1.295238 0.8431805 1.989659
        2 1.305797 0.8769752 1.944304
        3 1.406897 0.9314233 2.125090

$p.value
         two-sided
Predictor midp.exact fisher.exact chi.square
        0         NA           NA         NA
        1 0.22186384    0.2951163 0.21193013
        2 0.15357032    0.1884220 0.14587265
        3 0.07956697    0.1017336 0.07298506

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

Using a binomial regression:

Code
# Let's get the results directly exponentiated
binom_moussakaD_exp <- glm(case ~ moussakaD, data = copdata, 
                       family = binomial(link = "log")) %>% 
  tidy(exponentiate = TRUE, 
       conf.int = TRUE)

binom_moussakaD_exp

c) Champagne

Using epitools::riskratio function:

Code
epitools::riskratio(copdata$champagneD,                     
                    copdata$case,                     
                    conf.level = 0.95)
$data
         Outcome
Predictor FALSE TRUE Total
    0        27   21    48
    1        93  111   204
    2        12   32    44
    3        24   44    68
    Total   156  208   364

$measure
         risk ratio with 95% C.I.
Predictor estimate     lower    upper
        0 1.000000        NA       NA
        1 1.243697 0.8812613 1.755193
        2 1.662338 1.1502014 2.402507
        3 1.478992 1.0260296 2.131923

$p.value
         two-sided
Predictor  midp.exact fisher.exact  chi.square
        0          NA           NA          NA
        1 0.188922591  0.201367850 0.183280359
        2 0.005584928  0.006247816 0.004961941
        3 0.027627736  0.036334120 0.025117641

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

Using a binomial regression:

Code
# Let's get the results directly exponentiated
binom_champagneD_exp <- glm(case ~ champagneD, data = copdata, 
                       family = binomial(link = "log")) %>% 
  tidy(exponentiate = TRUE, 
       conf.int = TRUE)

binom_champagneD_exp

3.6 Summary

As a summary of what you’ve done above, answer these questions:

  • Is the respondents’ sex associated with being a case?

  • Is the school class associated with being a case?

  • Which foods increase the risk of being a case?

  • (Optional) Is there a dose-response relationship between the food items and being a case?

  • What do you think is the most likely culprit(s) of this outbreak at this point? Are there any risk factors you would like to highlight?