Data import and cleaning

1. Learning outcomes

At the end of the session, participants will be able to:

  • Import datasets in R
  • Apply basic data management techniques to explore and familiarise themselves with the dataset
  • Apply strategies for data cleaning

2. Story/plot description

You and your team have carried out hundreds of interviews in a very short time. Well done!

You will now have your first look at the data. Note that in real life you would have ideally done a pilot of the questionnaire and you would have prepared in advance a codebook and be familiar with the names and format of all variables.

Explore and clean the data you have collected and save a new version of the cleaned dataset in your computer. Be sure to make use of the R code that was provided to you previously, during the R practicals. You can also use the code you wrote during the Applied Epi course. You are expected to look at the help functions and google as well. Try to understand what the functions are doing and what their arguments are.

Work on your own computer and help your fellow colleagues as needed. The lead of the outbreak will be in charge of keeping the time (unless you designate another person, but be sure that someone keeps track of time).

Open R Studio and set up a new R project, and create a data and a scripts folder.

3. Tips you could follow

3.1. Install packages and load libraries

  • Load the R packages you will be using. Update this list of packages as you dig into the code: You may realise you need packages you didn’t think about at the beginning. Remember that it is a good practice to load tydiverse at the end of your package list, to avoid masking functions.
Code
# Check if the 'pacman' package is installed, if not install it:
if (!requireNamespace("pacman", quietly = TRUE)) install.packages("pacman")

# Load the required libraries into the current R session:
pacman::p_load(rio, 
               here, 
               tidyverse, 
               skimr,
               plyr,
               janitor,
               lubridate,
               gtsummary, 
               flextable,
               officer,
               epikit, 
               apyramid, 
               scales)

3.2. Import your data

  • You can find the spetses_school.csv file in EVA. Feel free to explore the codebook as well.
Code
# Import the raw data set:
copdata <- rio::import(here::here("data", "spetses_school.csv"))

3.3. Explore and clean your data

  • Have a look at the structure of your data. Here you have some questions that may help you get started:

    • How many observations and variables does the dataset contain?

    • What types of variables do you have and what types of values are recorded?

    • Do any of the values of the other variables look implausible? Which ones and why? What will you do about it?

  • Go ahead and clean your data as you see fit.

You could use head(), dim(), str(), or skim() to have a quick look at the data and get an idea of which variables you want to explore further. You can also do some tables

Code
head(copdata)
Code
dim(copdata)
[1] 397  40
Code
str(copdata)
'data.frame':   397 obs. of  40 variables:
 $ id        : chr  "sp-2-001" "sp-3-003" "sp-1-005" "sp-2-006" ...
 $ sex       : chr  "male" "female" "female" "male" ...
 $ age       : int  18 18 17 17 18 180 16 15 43 16 ...
 $ group     : int  1 1 1 1 1 1 0 1 0 1 ...
 $ class     : int  2 3 1 2 3 2 NA 1 NA 1 ...
 $ diarrhoea : int  1 NA NA NA 1 1 NA 0 1 NA ...
 $ bloody    : int  0 NA NA NA 0 0 NA 0 NA NA ...
 $ vomiting  : int  0 NA NA NA 0 0 NA 0 NA NA ...
 $ abdo      : int  1 NA 1 NA 1 1 NA 0 NA NA ...
 $ nausea    : int  0 NA 1 NA 1 0 NA 1 1 NA ...
 $ fever     : int  NA NA NA NA 0 0 NA 0 NA NA ...
 $ headache  : int  0 NA 1 NA 1 0 NA 1 1 NA ...
 $ jointpain : int  0 NA NA NA 0 0 NA 0 1 NA ...
 $ starthour : int  9 NA NA NA 15 15 NA NA 3 NA ...
 $ meal      : int  1 1 1 0 1 1 1 1 1 1 ...
 $ feta      : int  1 0 NA 0 1 1 1 1 1 1 ...
 $ fetaD     : int  2 0 NA 0 2 2 2 1 2 2 ...
 $ sardines  : int  1 0 NA 0 1 1 1 0 1 1 ...
 $ sardinesD : int  2 0 NA 0 2 2 2 0 2 2 ...
 $ eggplant  : int  0 0 NA 0 1 1 1 0 NA 1 ...
 $ eggplantD : int  0 0 NA 0 2 2 2 0 NA 2 ...
 $ moussaka  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ moussakaD : int  2 1 0 0 2 2 1 3 2 2 ...
 $ orzo      : int  1 1 1 0 1 1 1 1 1 1 ...
 $ orzoD     : int  3 3 1 0 2 2 2 3 2 3 ...
 $ greeksal  : int  1 1 NA 0 1 1 1 1 1 1 ...
 $ greeksalD : int  1 3 NA 0 2 2 2 2 2 2 ...
 $ bread     : int  1 1 1 0 1 1 1 1 1 1 ...
 $ breadD    : int  2 3 1 0 2 2 2 2 2 2 ...
 $ dessert   : int  1 1 NA 0 1 1 0 1 NA NA ...
 $ dessertD  : int  2 3 NA 0 2 2 0 1 NA NA ...
 $ champagne : int  1 1 0 1 1 1 1 1 1 1 ...
 $ champagneD: int  1 1 0 3 1 1 2 1 1 1 ...
 $ beer      : int  1 0 0 1 1 1 NA 1 NA 1 ...
 $ beerD     : int  3 0 0 3 2 3 NA 1 NA 2 ...
 $ redwine   : int  0 1 0 1 0 0 1 0 1 0 ...
 $ redwineD  : int  0 3 0 3 0 0 2 0 2 0 ...
 $ whitewine : int  0 0 0 1 1 1 1 1 NA 0 ...
 $ whitewineD: int  0 0 0 3 3 2 1 3 NA 0 ...
 $ dayonset  : chr  "6oct2024" "" "" "" ...
Code
skimr::skim(copdata)
Data summary
Name copdata
Number of rows 397
Number of columns 40
_______________________
Column type frequency:
character 3
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 8 8 0 397 0
sex 0 1 4 6 0 2 0
dayonset 0 1 0 8 175 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1.00 18.57 9.92 8 16 17 18 180 ▇▁▁▁▁
group 0 1.00 0.96 0.20 0 1 1 1 1 ▁▁▁▁▇
class 36 0.91 1.94 0.84 1 1 2 3 3 ▇▁▆▁▇
diarrhoea 141 0.64 0.82 0.39 0 1 1 1 1 ▂▁▁▁▇
bloody 200 0.50 0.03 0.17 0 0 0 0 1 ▇▁▁▁▁
vomiting 179 0.55 0.30 0.46 0 0 0 1 1 ▇▁▁▁▃
abdo 152 0.62 0.85 0.35 0 1 1 1 1 ▂▁▁▁▇
nausea 170 0.57 0.76 0.43 0 1 1 1 1 ▂▁▁▁▇
fever 223 0.44 0.26 0.44 0 0 0 1 1 ▇▁▁▁▃
headache 174 0.56 0.63 0.48 0 0 1 1 1 ▅▁▁▁▇
jointpain 207 0.48 0.16 0.37 0 0 0 0 1 ▇▁▁▁▂
starthour 177 0.55 12.49 4.92 3 9 9 15 21 ▁▇▁▆▃
meal 9 0.98 0.97 0.17 0 1 1 1 1 ▁▁▁▁▇
feta 16 0.96 0.71 0.45 0 0 1 1 1 ▃▁▁▁▇
fetaD 16 0.96 1.32 1.00 0 0 2 2 3 ▆▅▁▇▂
sardines 17 0.96 0.67 0.47 0 0 1 1 1 ▃▁▁▁▇
sardinesD 17 0.96 1.34 1.04 0 0 2 2 3 ▆▂▁▇▂
eggplant 30 0.92 0.59 0.49 0 0 1 1 1 ▆▁▁▁▇
eggplantD 30 0.92 1.14 1.05 0 0 1 2 3 ▇▂▁▇▂
moussaka 15 0.96 0.89 0.31 0 1 1 1 1 ▁▁▁▁▇
moussakaD 14 0.96 1.83 0.91 0 1 2 2 3 ▂▃▁▇▃
orzo 15 0.96 0.88 0.32 0 1 1 1 1 ▁▁▁▁▇
orzoD 15 0.96 1.81 0.91 0 1 2 2 3 ▂▃▁▇▃
greeksal 24 0.94 0.57 0.50 0 0 1 1 1 ▆▁▁▁▇
greeksalD 24 0.94 1.08 1.06 0 0 1 2 3 ▇▂▁▆▂
bread 18 0.95 0.91 0.29 0 1 1 1 1 ▁▁▁▁▇
breadD 18 0.95 1.75 0.71 0 2 2 2 3 ▁▂▁▇▁
dessert 42 0.89 0.42 0.49 0 0 0 1 1 ▇▁▁▁▆
dessertD 42 0.89 0.83 1.06 0 0 0 2 3 ▇▁▁▃▁
champagne 25 0.94 0.87 0.34 0 1 1 1 1 ▁▁▁▁▇
champagneD 25 0.94 1.37 0.93 0 1 1 2 3 ▂▇▁▂▃
beer 30 0.92 0.78 0.42 0 1 1 1 1 ▂▁▁▁▇
beerD 35 0.91 1.94 1.23 0 1 3 3 3 ▃▂▁▂▇
redwine 50 0.87 0.23 0.42 0 0 0 0 1 ▇▁▁▁▂
redwineD 52 0.87 0.45 0.92 0 0 0 0 3 ▇▁▁▁▁
whitewine 31 0.92 0.73 0.45 0 0 1 1 1 ▃▁▁▁▇
whitewineD 36 0.91 1.58 1.21 0 0 2 3 3 ▆▅▁▅▇
Code
names(copdata)
 [1] "id"         "sex"        "age"        "group"      "class"     
 [6] "diarrhoea"  "bloody"     "vomiting"   "abdo"       "nausea"    
[11] "fever"      "headache"   "jointpain"  "starthour"  "meal"      
[16] "feta"       "fetaD"      "sardines"   "sardinesD"  "eggplant"  
[21] "eggplantD"  "moussaka"   "moussakaD"  "orzo"       "orzoD"     
[26] "greeksal"   "greeksalD"  "bread"      "breadD"     "dessert"   
[31] "dessertD"   "champagne"  "champagneD" "beer"       "beerD"     
[36] "redwine"    "redwineD"   "whitewine"  "whitewineD" "dayonset"  

Let’s explore and manage some variables in detail.

Age

Through visual exploration of the age histogram we see that there is at least one very high value, likely implausible. You can then create a cross-tabulation of variables age and group to have a better idea of how your data looks like.

Code
# Have a look at the histogram  
hist(copdata$age)   

Code
# Create cross-tab with the group variable:  
janitor::tabyl(dat = copdata, 
               var1 = age, 
               var2 = group)

Note that group is coded as 0 and 1, and these may be difficult to interpret when they mean something other than “no” and “yes”, respectively. From the codebook, you know that teachers are represented by 0, and students by 1. Let’s change this to make our lives easier:

Code
# Convert group to a factor and label 0 as teacher, 1 as student:
copdata <- copdata %>% 
  mutate(group = factor(group, 
                        labels = c("teacher", "student")))

Now, have a look at your cross-tab again:

Code
janitor::tabyl(dat = copdata, 
               var1 = age, 
               var2 = group) 

With this table, we can more easily identify ages that are likely to be typographic errors. Specifically:

  • There is one teacher aged 16 (likely digit reversal - should be 61)

  • There is one student aged 8 (likely missing a digit - should be 18)

  • There is one student aged 180 (likely has an extra digit - should be 18)

Assuming you have contacted the school to make sure your suspicions about the actual ages are correct, we can now correct them, using case_when(). We create logical conditions to identify the incorrect ages, combining the values for age with the group they belong to:

Code
# Update incorrect ages to the correct values with case_when:  
copdata <- copdata %>%       
  mutate(age =                         
           case_when(                            
             # Where respondent is 16 and a teacher, change age to 61:
             age == 16 & group == "teacher" ~ 61,              
             # where respondent is 8 or 180 and a student, change age to 18:
             age == 8 & group == "student" ~ 18,              
             age == 180 & group == "student" ~ 18,              
             # Keep remaining values as is:              
             .default = as.numeric(age)   
             # if .default is not working, try:
             # TRUE ~ age
             )                    
         ) 

Cartoon showing a table with creature type (kraken, dragon, or cyclops) and age (baby, teen, or adult). The three creatures listed are adding a new column named “danger”, which contains the word “extreme!” if the type is “kraken”, or “high” for any other type. Stylized text reads “dplyr::case_when() - IF ELSE...but you love it? An example of code is shown: mutate(danger = case_when(type == “kraken” ~ “extreme!”, TRUE ~ “high”). Learn more about case_when.

  • Create summary table for the dose response columns (those finishing with a capital “D”) to have a look at their distribution.

Use gtsummary::tbl_summary().

Code
#| label: check_dose_cols
#| tbl-cap: no caption

# Create summary table for dose response columns: 
drtable <- copdata %>%       
  # Select all the columns with column names that end in upper case 'D':   
  select(ends_with("D", ignore.case = FALSE)) %>%       
  # Create the summary table, excluding missing values:   
  gtsummary::tbl_summary(missing = "no") 

  # Print the summary table: 
drtable
Characteristic N = 3971
fetaD
    0 110 (29%)
    1 78 (20%)
    2 155 (41%)
    3 38 (10.0%)
sardinesD
    0 125 (33%)
    1 35 (9.2%)
    2 184 (48%)
    3 36 (9.5%)
eggplantD
    0 151 (41%)
    1 41 (11%)
    2 147 (40%)
    3 28 (7.6%)
moussakaD
    0 42 (11%)
    1 70 (18%)
    2 184 (48%)
    3 87 (23%)
orzoD
    0 44 (12%)
    1 70 (18%)
    2 183 (48%)
    3 85 (22%)
greeksalD
    0 162 (43%)
    1 52 (14%)
    2 126 (34%)
    3 33 (8.8%)
breadD
    0 35 (9.2%)
    1 51 (13%)
    2 267 (70%)
    3 26 (6.9%)
dessertD
    0 206 (58%)
    1 33 (9.3%)
    2 87 (25%)
    3 29 (8.2%)
champagneD
    0 49 (13%)
    1 208 (56%)
    2 45 (12%)
    3 70 (19%)
beerD
    0 81 (22%)
    1 41 (11%)
    2 57 (16%)
    3 183 (51%)
redwineD
    0 266 (77%)
    1 31 (9.0%)
    2 21 (6.1%)
    3 27 (7.8%)
whitewineD
    0 100 (28%)
    1 73 (20%)
    2 67 (19%)
    3 121 (34%)
1 n (%)

3.4. Modify variables format

There are some “column types” you would like to modify because either R needs some specific types of data when working with certain variables, or because it will be easier for you to visualise and interpret the data. For example, the variable “group” is encoded as 0 and 1, when you create a table later on, you don’t want to see 0’s and 1’s because that doesn’t mean anything. It is better to change this variable to a factor, with explicit labels, so they actually mean something when they come out in your table.

Cartoon of cute fuzzy monsters dressed up as different X-men characters, working together to add a new column to an existing data frame. Stylized title text reads “dplyr::mutate - add columns, keep existing.” Learn more about dplyr::mutate.

You can use mutate to modify the type of the following variables:

Table 1: Variable types to modify
Variable name Original Desired Hint
sex character factor mutate(), as.factor()
class integer factor mutate(), as.factor()
All the clinical symptom variables integer logical mutate(across()), as.logical()
All the food variables representing the amount of specific foods eaten (those finishing with a capital “D”) integer factor mutate(across()), as.factor()
dayonset character date lubridate::dmy()
starthour and dayonset together

integer (starthour)

date (dayonset)

POSIXct, POSIXt lubridate::ymd_h() could have inside stringr::str_glue() with dayonset and starthour

Sex, group and class

Let’s start transforming one-by-one the first two variables in the table: sex, and class.

Code
copdata <- copdata %>% 
  dplyr::mutate(
    sex = as.factor(sex),
    class = as.factor(class))

Symptoms and Food variables

For these variables, we are going to show you a couple of different ways to carry out the same variable type transformation in a set of variables, so you don’t need to do one variable at a time. We are showing you these ways so you see alternative ways to do the same thing.

  1. For the variables that are clinical symptoms, we will list them one by one and show you the use of mutate(across( )).
Code
copdata <- copdata %>% 
  dplyr::mutate(
    # clinical symptoms
    across(.cols = c(diarrhoea, bloody, vomiting,
             abdo, nausea, fever,headache, jointpain), 
           .fns = ~ as.logical(.)
           )
    )
  1. For the variables that are food doses, we will show you how to first create a vector of names, following by using mutate(across(all_of( ))) on this vector.
Code
# Create a vector with all the food variables representing the amount of specific foods items eaten (those finishing with a capital "D")
# One way of doing it:
food_dose <- copdata %>% 
    dplyr::select(
      ends_with("D", ignore.case = FALSE)) %>% 
    names()

# Another way of doing it:
# food_dose <- c("fetaD", "sardinesD", "eggplantD", "moussakaD", 
#                 "orzoD", "greeksalD", "dessertD", "breadD", 
#                 "champagneD", "beerD", "redwineD", "whitewineD")


copdata <- copdata %>% 
  dplyr::mutate(
    # food dose variables
    across(.cols = all_of(food_dose), 
           .fns = ~as.factor(.))) 

The tilde (~) above is used to apply the transformation as.logical(.) to each selected column, which in our case is either all columns included in food_items and food_dose.

Date and time variables

Why may you want to transform the variable starthour to date-time?

Fuzzy cartoon monsters in construction hats sorting a pile of “Y” “M” and “D”s into separate piles labeled “Years”, “Months” and “Days”. Title text reads “Lubridate: wrangle times and dates!” Learn more about date-time wrangling with lubridate.

You can use lubridate::dmy() to mutate the dayonset variable into a date variable. Note that we are using the function dmy() because dates are formatted as day, then month (abbreviated character string), then year (i.e. “12nov2006”).

Code
# Have a look at how the data is stored
head(copdata$dayonset)
[1] "6oct2024" ""         ""         ""         "6oct2024" "7oct2024"
Code
class(copdata$dayonset)
[1] "character"
Code
# Update copdata:
copdata <- copdata %>% 
  # Change column to date class:
  dplyr::mutate(
    dayonset = lubridate::dmy(dayonset))

# Check class of updated column:
class(copdata$dayonset)
[1] "Date"
Code
# Have a look at your data now:
head(copdata$dayonset)
[1] "2024-10-06" NA           NA           NA           "2024-10-06"
[6] "2024-10-07"

Having a variable that defines “time” in an outbreak investigation can be very useful when creating a case definition. An hour of the day, without a date associated with it doesn’t help you much, thus, you should merge together day and time of onset of symptoms into a single variable. Moreover, you will be using this combined variable later on to estimate an incubation period and create your epicurve. We can combine these two variables by using the lubridate::ymd_h() function.

Before we proceed, it would be wise to check if any respondents have a value for dayonset but not starthour, or vice versa. The lubridate date-time conversion functions do not have an explicit argument for dealing with missing values, but the truncated = … argument can help prevent spurious date-times being derived from a date-time combination where one value is missing.

We can check if we have any missing values by cross-tabulating starthour with dayonset:

Code
# Cross-tabulate dayonset with starthour:
janitor::tabyl(dat = copdata, 
               var1 = starthour, 
               var2 = dayonset)

This shows us that there are two respondents who had an onset date, but are missing onset time (starthour). Since starthour is represented by 1 - 2 digits, we can specify that we want lubridate to also parse date-time combinations that are truncated by up to two digits:

Code
copdata <- copdata %>% 
  # Combine dayonset and starthour in a new date time variable:
  mutate(onset_datetime = 
           lubridate::ymd_h(
             str_glue("{dayonset}, {starthour}"), 
                                           # Deal with missing starthour:
                                           truncated = 2))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `onset_datetime = lubridate::ymd_h(str_glue("{dayonset},
  {starthour}"), truncated = 2)`.
Caused by warning:
!  175 failed to parse.

Note that we needed to use str_glue() to concatenate dayonset and starthour together before we could convert the variable to a date-time object. This is because the ymd_h() function expects a single character string, containing both the date and the time, as input.

The argument truncated = 2 will result in dates with missing starthour still being converted to date-time, with the missing time being set to 00:00 (midnight). Whether you want to deal with missing starthour in this way or prefer to code these date-times as NA will depend on how you want them to be represented in your analysis.

Now we can check that everything in the new combined date-time variable has parsed correctly:

Code
head(copdata$dayonset)
[1] "2024-10-06" NA           NA           NA           "2024-10-06"
[6] "2024-10-07"
Code
head(copdata$starthour)
[1]  9 NA NA NA 15 15
Code
head(copdata$onset_datetime)
[1] "2024-10-06 09:00:00 UTC" NA                       
[3] NA                        NA                       
[5] "2024-10-06 15:00:00 UTC" "2024-10-07 15:00:00 UTC"

3.5. Export clean data

Save the cleaned data set before proceeding with using your case definition to identify cases in your dataset. Use the .rds format, as it preserves column classes. This ensures you will have only minimal cleaning to do after importing the data into R at the next inject.

Save your clean data as a new file “Spetses_clean1_YOURINITIALS.rds”, under the data folder of your Rproject.

Use rio::export().

Code
rio::export(x = copdata, 
            file = here::here("data", "Spetses_clean1_2024.rds"))