1. Learning outcomes

At the end of the session, participants will be able to:

  • Apply case definition criteria to a dataset using logical arguments in R

2. Story/plot description

You will now create a new column in the data set to hold the case definition you decided in a previous step during your investigation. You can call this column case and set it to TRUE if the individual meets the case definition criteria and FALSE if not. You will use this column later on for any calculations needed (descriptive statistics, two-by-two tables to compute measures of association, etc.) to figure out the culprit of this outbreak.

3. Questions/Assignments

If you closed your R project, please open it again and load the clean dataset. We strongly recommend you use the Spetses_clean1_2024.rds. This is just to be sure all needed changes in that file for the code to work have been done (as you may have changed the file in a slightly different way than we have).

To be sure we all start from the same case definition, let’s agree that a case was defined as a person who:

  • attended the school dinner on 11 November 2006 (i.e. is on the linelist)
  • ate a meal at the school dinner (i.e. was exposed)
  • fell ill after the start of the meal
  • fell ill within the time period of interest after the school dinner
  • suffered from diarrhoea with or without blood, or vomiting

Non cases (not-ill) were defined as people who:

  • attended the school dinner on 11 November 2006 (i.e. are on the linelist)
  • ate a meal at the school dinner (i.e. were exposed)
  • did not fall ill within the time period of interest
  • did not develop diarrhoea (with or without blood) or vomiting

3.1 Install packages (if needed) and load libraries

Code
# Load the required libraries into the current R session:
pacman::p_load(rio, 
               here, 
               tidyverse, 
               skimr,
               plyr,
               janitor,
               lubridate,
               gtsummary, 
               flextable,
               officer,
               epikit, 
               apyramid, 
               scales)

3.2 Import your data

Code
# Import the clean data set:
copdata <- rio::import(here::here("data", "Spetses_clean1_2024.rds"))

3.3 Identify the variables you need to apply the case definition criteria.

The variables we need from the dataset to apply the above case definition are: meal, onset_datetime, diarrhoea, bloody and vomiting.

3.4 Create a new case column to hold the binary case definition variable. Let’s think about how to do this little by little:

a) Ate a meal at the school dinner

You decide to exclude any people from the cohort who didn’t eat at the dinner, because we specifically hypothesised a food item to be the vehicle of infection in this outbreak. Thus, filter your dataset to those who ate a meal: Keep in your dataset only those who ate a meal.

filter by those with meal == TRUE.

What are some of the implications this decision may lead to? (excluding any people from the cohort who didn’t eat at the dinner)

Code
copdata <- copdata %>% 
  filter(meal == TRUE)

Seven of the respondents actually said they did not eat the meal, but when it came to the questions about which food items they ate, they provided answers! This issue could have been minimised by:

- At the survey state, one could adjust the design of an electronic questionnaire to prevent key questions from being skipped. This can come with both pros and cons. Allow fellows to discuss if time allows.

- Explore your data further, realise this is the case, and recode the meal variable for these individuals as TRUE. => This would be the way to go, but is not what we did in our example because we tried to keep it simple, and also because it is good to show that you may not always clean the data perfectly, and that has consequences: You can highlight the importance or really explore your data in depth.

By making the above decision, we may be missing cases and non-cases people, and thus, modifying the final estimate of our measure of association. => It is very important to know your data, explore it deeply and try to clean it as well as possible. Every step one makes when cleaning the data may have a consequence, and we should be aware of it when making the data cleaning decisions and when interpreting the results.

b) Fell ill after the start of the meal

We define “fell ill” as any person having had diarrhoea with OR without blood, OR vomiting. To capture this information easily, you will create a new gastrosymptoms variable. This variable will indicate that the person had one OR (“or” is R is achieved by using |) more of the clinical symptoms in your definition.

Note that we the concept of having eaten a meal is already included as per one of the steps above.

Create a new gastrosymptoms column in your line list with mutate() and case_when().

What are some of the implications this decision may lead to? (defining “fell ill” as any person who reported having diarrhoea with OR without blood, OR vomiting)

Having one clinical symptom enough to be considered a potential case at this point may be considered too unspecific (low specificity). For example, a person who ate at the dinner party and developed diarrhoea for other reasons other than food poisoning (say, they recently started on antibiotics known for unbalancing the intestinal flora and causing diarrhoea) could be misclassified as a potential case. (Note we talk about potential case, and not case; that is because here we are not talking about cases per-se yet, but this decision has implications for when applying the case definition below).

Moreover, those who did not report clinical symptoms will be defined as non-cases. Thus, we are assuming that these individuals did not develop symptoms because they didn’t report them. The missing values could be due to, for example, them skipping the questions in the questionnaire. Some individuals may be reluctant to report symptoms, due to shame, fear of repercussion, or others. It is important to think ahead, before the interview, about ways to minimise these situations. For example, through questionnaire design, you may impede skipping questions; you could promote trust by using the right interviewers (in some cases this will be someone from the community, in others someone form specific NGOs, someone of a specific race or gender, etc); choose to carry out online questionnaires vs in person (or vice versa, depending on the situation), etc.

Code
copdata <- copdata %>% 
  mutate(gastrosymptoms = case_when(
    # Those had diarrhoea...
    diarrhoea == TRUE |
      #or bloody diarrhoea...
    bloody == TRUE |
      # or vomiting, are marked as TRUE (fell ill after the meal)
    vomiting == TRUE ~ TRUE,
    # The rest are FALSE. This includes those who ate a meal but had no symptoms (did not fell ill after the meal)
    .default = FALSE)
    )

c) Fell ill within the time period of interest

  • Hmmm… what is the time period of interest? You could start calculating the incubation period, which can be defined by calculating the time between exposure (the meal) and onset of symptoms, and then looking at the distribution of these time differences. In this outbreak, incubation periods are easy to calculate, because everyone was exposed at (roughly) the same time and on the same day (eating the meal at the school dinner party).
  • Dinner was served at 18:00 roughly for everyone. Create a new meal_datetime variable with this information for all people in your dataset (as per 05 Oct 2024, at 18:00h).
Code
# Start with copdata:
copdata <- copdata %>% 
  # Create new column for meal date and time:
  mutate(meal_datetime = lubridate::ymd_hm("2024-10-05 18:00"))
  • You can calculate the incubation period for each participant by subtracting onset_datetime - meal_datetime.
  • Then, take the median() of that column to calculate a median incubation period, this will help you start having some hypothesis about the type of pathogen we are dealing with until the lab results come back.

Be aware of the type of variable incubation is (check class()), as well as of missing valuesNA.

The incubation period of a disease is the time from an exposure resulting in infection until the onset of disease. Because infection usually occurs immediately after exposure, the incubation period is generally the duration of the period from the onset of infection to the onset of disease – Rothman, Greenland, Lash (2017): Modern Epidemiology, 3rd edition

Code
copdata <- copdata %>% 
  mutate(incubation = onset_datetime - meal_datetime,
         incubation = as.numeric (incubation))


median(as.numeric(copdata$incubation), na.rm = TRUE)
[1] 15

If you pay attention, there were two individuals who don’t have a time recorded, only a date, the day of the dinner. These people probably got sick the same night they had the dinner, but the exact time was not recoded. They way we’ve managed the data will insert “00:00” in a missing value of a time. This means that the 2 people that got sick the day of the dinner have been recorded sick before the dinner (in the early morning of the day of the dinner). The two people have a negative incubation period! This should not happen. If you feel the force is with you, you can modify this and fix the error (if you do this, your results will be a bit different than your colleagues, but similar). We will continue with this “erroneous” data, for easiness. But note that in real life, as soon as you realise there is a coding error like this, you would have to go back to your cleaning code and fix the error! This may happen a couple of times during your outbreak investigation, and is totally normal. Cleaning data is not an easy job!

We see that the median incubation time is 15 hours. This is useful information, as incubation periods tend to be relatively pathogen-specific.

Based on this, say you define and limit the maximum incubation period to 48 hours after the meal, as the data points to a fast-acting bacterial toxin or a virus. That is, dinner participants should have developed (onset_datetime) at least one symptom (diarrhoea, bloody or vomiting) 48h after meal_datetime to become a case.

What are some of the implications this decision may lead to? (implications of this case definition)

All those not developing at least one symptom (diarrhoea with OR without blood, OR vomiting) 48h after the dinner are considered non-cases. This could (depending on how you decide to analyse your data) include those who had no symptoms at all, those who have missing data on the onset_datetime variable, and/or those who had symptoms before eating the meal. This is a reminder that you need to be both careful and aware of the implications of your data analysis decisions. If a person had clinical symptoms before eating the meal, they are considered as not-cases. However, it could be that a person had symptoms before the meal, and yet, still got infected by the pathogen when eating their meal (bad luck, we know…). According to this definition, we would be missing that case.

d) Finally, with this information you can create a new case column to hold the binary (TRUE/FALSE) case definition variable.

Code
copdata <- copdata %>% 
  mutate(case = case_when(
    # Those who had symptoms <48h from the meal are cases (TRUE)
    gastrosymptoms == TRUE & 
      onset_datetime >= meal_datetime &
      onset_datetime <= (meal_datetime + days(2)) ~ TRUE,
    # Those who had symptoms >48h from the meal are non-cases (FALSE)
    gastrosymptoms == TRUE & 
      onset_datetime > (meal_datetime + days(2)) ~ FALSE,
    # The rest are considered non-cases. Including, those who had no symptoms at all, who have missing data on the onset_datetime variable, or who had symptoms before eating the meal 
    .default = FALSE)
  )

Note that we may be incurring in misclassification bias with the code above. The last section indicates that if a person had clinical symptoms before eating the meal, they are considered as non-cases. However, it could be that a person had symptoms before the meal, and yet, still got infected by the pathogen when eating their meal (bad luck, we know…).

Moreover, if you remember, there were a couple of people with an dayonset, but no starthour. The code we used (lubridate::ymd_h with argument truncated = 2) results in dates with missing starthour being converted to date-time, with the missing time being set to 00:00 (midnight). This means that these two people don’t fulfill the case definition criteria because we marked their symptoms started early in the morning of Nov 11 (at 00:00), before the meal time (18:00), and thus, they did not “fell ill within the time period of interest”.

The two situations above are a reminder that you need to be both careful and aware of the implications of your data analysis decisions.

What do you think are the risks of mis-classifying cases as non-cases in your analysis?

We will have bias either towards or away from the null, depending on the proportions of subjects misclassified.

Code
# Tabulate cases:
janitor::tabyl(dat = copdata, case)

Let’s have a look at how many people ate a meal, had symptoms, and were considered as cases after applying our case definition:

Code
copdata %>% 
  summarise(atemeal = sum(meal == TRUE),
            hadsympt = sum(gastrosymptoms == TRUE),
            nb_cases = sum(case == TRUE)
            )

4. Export clean data

Finally, we can save the cleaned data set as a new file Spetses_clean2_YOURINITIALS, under the data folder, before we proceeding with descriptive analysis.

Code
rio::export(x = copdata, 
            file = here::here("data", "Spetses_clean2_2024.rds"))