This case study was originally written by the following authors:
An R companion guide for the computer practicals accompanying the case study was developed in 2017 by:
This was updated in 2018 by:
A significant update was made in 2022, to replace base R code with
more modern functions from tidyverse
packages and to add an
alternative mixed-methods based approach for section 5 following the
introduction of this approach in the STATA version of the case study
guide by Emily White Johansson in 2022. R code was also added for
sessions 1 and 2. The R guide was updated in 2022 by:
The latest updates for the current version address feedback from EPIET facilitators and fellows and also include some minor bug fixes. These updates were implemented in 2023 by:
All copyrights and licenses of the original document apply here as well.
This guide has been designed for use in two different ways; which one to follow will depend on how experienced you are in using R for analysis, and also your familiarity with the statistical concepts being covered. We recommend that facilitators choose which method to follow after discussing with their groups to determine the level of R experience and background knowledge on the statistical concepts.
We recommend that regardless of which method you choose, the learning objectives remain focused on gaining an understanding of the statistical and epidemiological concepts to apply, particularly around sampling and survey design.
Method 1 - run as a paper discussion-based case study
Choose this method if participants have limited to no experience with
R, or the group is very mixed, or there is less background familiarity
with the statistical concepts. The tasks in sessions 1 to 3 can mostly
be completed by hand with a calculator; for sessions 4 and 5 the results
tables can be the focus of the discussion. If you are using this method,
we recommend you click on the button to hide code
at the
top of the html version of this document. This will make the text easier
to read. The R coding tips
sections can be skipped, and
participants can refer to these later in their own time if they
wish.
Method 2 - run as a computer practical
Choose this method if at least the majority of participants in the group are already using R in their work and ideally, have also undertaken the Outbreak and MVA module computer practicals in R. We recommend that a Microsoft Teams meeting be set up for each group (even face to face groups) so that facilitators can share their screen (with the sound muted if in person). Facilitators should install all the required packages on the computer they will be using, and attempt to knit this R markdown file prior to the course.
It is also recommended that participants attempt to install all the required packages before starting the course. It is advisable to allocate some time outside of that allocated for this case study, to troubleshoot package install problems, at the beginning of the course. Prior experience has shown that installation problems can be time consuming to resolve.
We suggest that the case study begins as for the discussion-based method, with groups reading the background information and considering the questions and how they would go about completing the tasks together. Participants may then attempt to run the code on their computers, first reading and following the R coding tips and then by running each R markdown chunk and inspecting and discussing the output. For any troubleshooting, facilitators can share their screens to demonstrate solutions or explain particular aspects of how the code works, using live demonstration techniques.
If participants are encountering difficult to resolve problems with R, these are most likely to be due to their R set up and the security settings on their computers (which are becoming increasingly complex in most institutions). To avoid running out of time for conceptual epidemiological discussions, we recommend encouraging participants to pair up with other group members that do not have these issues, for the rest of the session. Further troubleshooting with setup and package install issues can then be addressed during the breaks.
This guide is not intended to be an introduction to R, so if you wish
to follow the case study by running the R code, it is essential to have
a working knowledge of R syntax (particularly the syntax used by
dplyr
and tidyverse
packages). Understanding
the R universe, syntax and the way that packages work to undertake data
cleaning, management and manipulation tasks is a massive topic, and
outside the scope of this guide. If you are relatively new to R,
however, you may find it helpful to work through the main data
management chapters in the online Epidemiologist R handbook prior
to undertaking this case study. This is a relatively new resource that
was created especially for epidemiologists who want to get started using
R. Wherever possible, in this guide we have used the approaches
demonstrated in the Epidemiologist R handbook to undertake the generic
data manipulation tasks; you may therefore also find it useful as a
reference.
The R coding tips for each session focus principally on explaining the R functions necessary to undertake the analytical aspects of the task in question. Some guidance is also provided on ways to present the results. When a new function is introduced, a description of how it works is included in the text, and each line of code is also annotated.
Note that some prior knowledge about descriptive, single variable and multivariable analysis is expected from previous modules (i.e. Outbreak module, MVA).
This guide is for participants who wish to undertake the RAS vaccine coverage case study using R as the primary analysis software. All the information required to complete the case study has been included. The following materials have been provided in the participant pack:
This guide, available in three formats:
data
folder containing the required data sets for
this case study:
Sampling_frames.xlsx
vaccine4.dta
vaccine5.dta
RAS_VCE_R_packages2install.R
: R script for
installing required packages
RAS_VCE_R_guide.Rmd
: R markdown with coding
solutions for each task
You may find it easier to view and work through the html file in your internet browser, as it has the following advantages:
code
buttons to reveal or hide the code
chunksThe R markdown file can be used to run each chunk of code and explore the results as is, or alternatively, to modify the code if participants wish to explore the R functions in more depth.
The case study is broken into 5 sessions, each of which consist of one or more tasks to complete. For each session, you are presented with the following sub-sections:
A note on similarities with the STATA case study guide:
This R guide has been aligned as far as possible with the original STATA guide; this means that the same statistical methods are used in both guides. However, you may find minor differences in the numeric results; this is usually due to differences in rounding or minor differences in the calculations between the two software platforms. None the less, the results from both platforms should yield the same interpretation.
It is advisable to have up-to-date versions of R and RStudio installed. The code in this guide was updated and tested using:
Note that if you are using a Microsoft Windows operating system, you
will also need to have the separate software Rtools
installed on your computer. Some of the packages required for this case
study come from a GitHub repository, and packages there are typically
not stored in a Windows-compatible format. Rtools
converts
GitHub packages to a windows-compatible format during the installation
process and compiles them on your computer. Note that your
Rtools
version should be compatible with your R version.
You can download and install the relevant Rtools version from the CRAN
repository here. Note you
may need administrator rights to install it.
This guide uses packages from the tidyverse
which are
being used more and more for data analysis, due to their easy to read
and efficient coding syntax.
The packages required for this case study can be installed by running
the code below. This code has also been made available as a separate R
script in the participant pack, called
RAS_VCE_R_packages2install.R
. It uses the
pload()
function from the pacman
package as
this will check the user’s library for required packages, install them
if they are missing, and then load them into the current R session.
First, check that pacman
is installed and install it if
not:
# Packages to install *before* running pacman pload():
#####################################################
# Ensures the package "pacman" is installed
if (!require("pacman")) install.packages("pacman", quiet = TRUE)
# Ensure the package "remotes" is installed (used to compile packages)
if (!require("remotes")) install.packages("remotes", quiet = TRUE)
Now you can use the pacman::p_load()
function to check
for and install (if not already in your R library) packages from the
CRAN repository, and the sister function
pacman::p_load_gh()
to do the same for packages from GitHub
repositories.
Note that these functions also load installed packages into your R session, so you will need to rerun this chunk every time you start a new R session:
# Install and/or load required packages:
########################################
pacman::p_load(
# Project and file management
#############################
here, # file paths relative to R project root folder
rio, # import/export of many types of data
rmarkdown, # reading and printing RAS R markdown guide
knitr, # reading and printing RAS R markdown guide
# Package install and management
################################
pacman, # package install/load
remotes, # install from github
# General data management
#########################
tidyverse, # includes many packages for tidy data wrangling:
#dplyr, # data management
#tidyr, # data management
#ggplot2, # data visualization
#stringr, # work with strings and characters
#forcats, # work with factors
#lubridate, # work with dates
#purrr # iteration and working with lists
janitor, # data cleaning
matchmaker, # data cleaning
# Summary tables:
#########################
flextable, # Creating printable summary tables
officer, # Fine tuning flextables
gtsummary, # making descriptive and statistical tables
smd, # Helper functions for gtsummary tables
labelled, # Labelling variables and values
scales, # helper functions
Hmisc, # Summary functions
# Analysis:
#########################
pps, # Probability proportional to size (PPS) sampling
sampler, # Calculate sample size
srvyr, # Create survey design and calculate vaccination coverage
survey, # Support for srvyr functions
fixest, # Multivariable models with multiple fixed effects
broom.mixed # Extract tidy table of mixed effects model results
)
# Check for and install Github packages:
pacman::p_load_gh(
# Packages for survey analysis and mixed effects:
#################################################
"R4EPI/sitrep" # Function wrappers for survey analysis
)
Note that the first time you run this code, it will install any missing packages from the relevant repository. You will need a stable internet connection for this, and it will take several minutes. Do not try to run any other code until the package installs are completely finished.
We recommend that when you run this code for the first time, you check the output carefully for any errors that could indicate a failure to install some packages. If you do encounter any errors or other difficulties, please ask a facilitator for help during the pre-case study session that is dedicated to this.
A useful test to ensure packages have been installed correctly, is to close RStudio after running this code the first time, then reopen RStudio and run this code chunk. If you find that R is still trying to install packages, this likely means that some packages failed to install the first time. If all packages and their dependencies installed correctly the first time, rerunning this code chunk should simply load the packages into your R environment.
File organisation:
The R practical material for this case study is available in a zip
folder called RAS_VCE_participant_pack
. You can download
this folder here
or from the EVA platform. Once you have downloaded it, unzip the file
and save it on your computer. We recommend saving this file to a
location that is not connected to OneDrive or to any other Shared Drive,
if possible, as this can sometimes interfere with package installs or
other functionality.
There is an R project file (RAS_VCE_R_project.Rproj
) in
this folder and we suggest you begin this practical session by double
clicking on this file to open it.
In the Files
pane of your RStudio window, you should see
a folder called rcode
. Click on this folder in the RStudio
pane to view the contents. Once you are inside this folder, you should
see an R markdown file called RAS_VCE_R_code_2022.Rmd
.
Click on this file to open it. It contains all the code chunks that are
also present in this guide, but without explanatory text. This file is
for you to run and explore the code in each chunk, modify and experiment
with the code if you wish, and make your own notes.
The data sets have been stored in a sub-folder called
data
.
Note that the relative file paths used for importing data in this guide assume the above folder structure.
Importing data sets:
The data sets can be easily imported using the import()
function from the rio
package, combined with the
here
package to construct relative file paths.
Most of the data sets are STATA .dta
files, while the
data required for the first three excercises is saved in a Microsoft
Excel workbook. Both file types can be read in to R using
rio::import()
and specifying the relevant arguments.
The data dictionary for each data set is included in the appendices at the end of this guide.
This case study is built around a national survey that used stratified cluster sampling to estimate the vaccination coverage of six year-old children in Greece and to identify predicting factors of complete and timely vaccination status. Teaching methods will include presentations, interactive exercises and exercises using computers.
Objectives
The aim of this case study is to equip participants with the tools to design and conduct a cross-sectional study to estimate vaccination coverage and analyse data from a cross-sectional survey. At the end of the course, participants should have acquired the following skills:
1. Epidemiological skills:
2. Computer skills:
Case study sessions
This case study consists of the following 5 sessions:
Session 1: Sampling
Session 2: Sample size calculation
Session 3: From analysis plan to data collection
Session 4: Descriptive statistics with weighted proportions
Session 5: Analysis of associations with having a fully vaccinated child
Duration of the case study: 7-9 hours (additional 3-4 hours for the optional sessions)
Vaccine-preventable diseases are still a public health burden worldwide, largely due to suboptimal vaccination coverage, and constitute an important component of the health care debate in many countries. To improve vaccination coverage and reach high immunisation levels, the reasons for inadequate vaccinations need to be identified and addressed.
Vaccination in Greece
In Greece, vaccines included in the National Vaccination Programme (NVP) (Box 1) are provided free of charge to all residents (including immigrants) in primary health care centres or health insurance clinics. Childhood vaccination coverage is not monitored routinely. Several ad-hoc studies have been conducted occasionally at local level using non-representative samples. Most of the current knowledge on immunisation uptake is based on two national population-based surveys conducted in 1997 and in 2001 using representative samples of children attending the first year of Greek Grammar school (children born in 1991) and of two-year olds (children born in 1997), respectively. Results of those surveys showed high vaccination coverage [93.5% and 98.9% for the 3rd dose of diphtheria-tetanus toxoid and pertusis (DTP) vaccine; and 89.1% and 93.7% for the 1st dose of measles-mumps-rubella (MMR) vaccine in 1997 and 2001, respectively]. However, an outbreak of measles in 2006 revealed low coverage among the Roma minority (1% of the population) and immigrants (8% of the population) who accounted for 55% and 15% of all cases, respectively.
Box 1- Recommended childhood vaccination schedule according to the National Vaccination Programme (NVP) at the time of the study in Greece
Recommended age | DTaPa | Hibb | IPVa | MMRa | HepBb | MCVc | PCV7c | Varc | BCG | dT |
---|---|---|---|---|---|---|---|---|---|---|
2 months | Yes | Yes | Yes | Yes | Yes | Yes | ||||
4 months | Yes | Yes | Yes | Yes | Yes | Yes | ||||
6 months | Yes | Yes | Yes | |||||||
6-18 months | Yes | Yes | ||||||||
12-15 months | Yes | Yes | Yes | |||||||
12-18 months | Yes | |||||||||
15-18 months | Yes | Yes | ||||||||
4-6 years | Yes | Yes | Yes | |||||||
6 years | Yes | |||||||||
11-18 years | Yes | |||||||||
13-18 years | Yes | |||||||||
>18 years | Yes | |||||||||
aDTaP, IPV & MMR were introduced in the NVP before 1990 | ||||||||||
bHib & HepB introduced in 1998 and 2002, respectively | ||||||||||
cMCV, PCV7 and Var introduced in 2006 (although MCV was available in pharmacies since 2001) |
Scenario
You work at the national public health institute and you are asked to provide data on vaccination coverage of children in Greece and identify the potential reasons for low vaccination uptake.
Acronym or term | Meaning |
---|---|
BCG | Bacillus Calmette-Guérin |
DTP | Diphtheria-Tetanus-Pertusis vaccine |
DTaP | Diphtheria-acellular Tetanus-Pertusis vaccine |
dt | reduced diphtheria toxoid-Tetanus toxoid |
Hib | Haemophilus Influenzae type b |
IPV | Inactivated Poliomyelitis Vaccine |
MMR | Measles Mumps Rubella vaccine |
OPV | Oral Poliomyelitis Vaccine |
HepB | Hepatitis B |
HepA | Hepatitis A |
MCV | Meningococcal Conjugate Vaccine |
Men C | Meningitis C |
PCV7 | Pneumococcal Heptavalent Vaccine |
PR | Prevalence Ratio |
RR | Risk ratio / Rate ratio |
OR | Odds Ratio |
95% CI | 95% Confidence Intervals |
Session overview
This session focuses on how to select a representative sample from a specified population, using basic and more complex sampling methods.
By the end of this session, participants should be able to:
This session consists of 6 Tasks:
Background:
The aim of the study was to estimate the vaccination coverage among 6-year-old children in Greece and identify reasons for incomplete vaccination status.
The specific objectives were to:
Task instructions:
Discuss and answer the following questions:
What sampling frames might be available?
What are some necessary prerequisites for using them?
What options of sampling schemes would you consider?
What are the advantages and disadvantages of each one?
Explain in plain language the sampling strategy chosen by the researchers and describe the different steps of sampling.
In the selected sampling design:
“Samples of approximately equal size were selected in each region”:
Researcher selected strategies:
Q 1 - 2 (Sampling frames and prerequisites):
You may consider the following sampling frames:
In this survey the researchers decided to take a sample of 1st grade school children.To assess whether this study population is appropriate, it would be necessary to have information on:
A full list of school children was impossible to obtain. Nevertheless, it was feasible to get a list of all schools in the country, with number of first grade classes in each school.
Q 3 - 4 (Sampling schemes, advantages and disadvantages):
The following options of sampling schemes should be considered:
A. Sampling of children (final sampling units):
B. Two-stage cluster sampling:
C. Multi-stage cluster sampling:
D. Sampling of school-classes or schools (clusters, primary sampling units):
E. Systematic sample of school classes or schools:
F. Stratification by region:
G. Stratification by region and by urban/rural area:
Fig. 1.1 - Map of Greece with regions used in stratification
The researchers decided to use the following sampling strategy: stratified cluster sampling of first-grade school children, using school classes as clusters and stratification by region (see map) and urban/rural area; approximately equal sample size was used in each region, and within each region sample size in urban/rural areas was proportional to size; all children in each school-class were selected.
Q 5 (Explain chosen sampling strategy):
A national sample of first grade school children was selected using the following sampling strategy:
Q 6 (Metrics for chosen sample design):
The following metrics describe the chosen sample design:
Q 7 (Rationale for selecting samples of equal size per region):
The researchers had set as their objective to estimate vaccination coverage by geographical region. Therefore, the sample size would have to be adequate in each and every region in order to have estimates of acceptable precision. However, the researchers chose to get samples of equal size in each region. (In fact, they aimed at a somewhat higher precision for the region of Attiki, and therefore the sample size for this region was somewhat larger).
As the population of each region is different, opting for equal sample size in each region leads to different sampling fractions for each region. This means that weighting should be used in the analysis (for national level results) to take account of the different sampling fractions of the regions. This increases the complexity of the analysis.
Background:
Stratified sampling is used when the population consists of distinct subgroups (strata), which differ with respect to the feature under study. In this study, researchers expected urban and rural areas in each region to be different in terms of vaccination coverage. Therefore, the same proportion of clusters (school classes) was taken from each stratum within a region to ensure that they were all adequately represented.
Your first task is to calculate the number of classes to sample in each region and rural / urban stratum.
Task instructions:
Using the data provided in table 1.2 below (which can be imported
into R from the table1_2
worksheet of the
Sampling_frames.xlsx
Microsoft Excel workbook in the
data
sub-folder):
Within each region, calculate the number of clusters (school classes) to sample from;
Calculate the number of classes for each urban / rural stratum relative to the population size;
Aim to include 50 school classes from each region;
Assume that all school classes have an equal number of students;
Append your results to table 1.2 (percentage and number of classes to sample for each urban and rural stratum in each of the 6 regions) to complete it.
Region | Urban | Rural | TOTAL |
---|---|---|---|
Thraki | 11624 | 6884 | 18508 |
Macedonia-Thessalia | 116322 | 47101 | 163423 |
Ipeiros-Ionian Islands | 11446 | 15449 | 26895 |
Peloponissos-Sterea | 55770 | 44132 | 99902 |
Attiki | 174493 | 1667 | 176160 |
Crete-Aegean Islands | 36374 | 24752 | 61126 |
TOTAL | 406029 | 139985 | 546014 |
R coding tips:
First, import the summary data from table 1.2.
This table is currently in wide format, but to calculate the number
of classes to sample from in each strata, it needs to be converted to
long format (i.e. ‘tidy data’) so that a single column defines the
strata (urban or rural) and another column contains the population for
each stratum. We can do this with the tidyr::pivot_longer()
function.
We can then calculate the relative proportions in each stratum and
from there determine the number of school classes to sample (out of 50)
for each stratum within each region. We will then use the
qflextable()
function from the flextable
package to display the results in a printable format. This function
takes a data frame and formats it nicely.
Table 1.2 - Updated with proportions and number of classes to sample per stratum
# Import sample frame for regions:
regions <- rio::import(here::here("data", "Sampling_frames.xlsx"),
which = "table1_2")
# Create new table for results:
nclasses2sample <- regions %>%
# Convert table to tidy data:
tidyr::pivot_longer(cols = c(Urban, Rural), # Convert to tidy format
names_to = "Stratum", # Store strata in new col
values_to = "Population") %>% # Store population in new col
# Calculate proportions and number of classes to sample:
mutate(Proportion = Population/Total, # Calculate proportions
Nclasses = round(50*Proportion, digits = 0))# Calculate number of classes
#################
# Print results:
################
# Convert to a flextable:
nclasses2sample_pub <- flextable::qflextable(nclasses2sample) %>%
flextable::set_table_properties(width = 1, layout = "autofit")
# Print:
nclasses2sample_pub
Region | Total | Stratum | Population | Proportion | Nclasses |
---|---|---|---|---|---|
Thraki | 18,508 | Urban | 11,624 | 0.628052734 | 31 |
Thraki | 18,508 | Rural | 6,884 | 0.371947266 | 19 |
Macedonia-Thessalia | 163,423 | Urban | 116,322 | 0.711784755 | 36 |
Macedonia-Thessalia | 163,423 | Rural | 47,101 | 0.288215245 | 14 |
Ipeiros-Ionian Islands | 26,895 | Urban | 11,446 | 0.425580963 | 21 |
Ipeiros-Ionian Islands | 26,895 | Rural | 15,449 | 0.574419037 | 29 |
Peloponissos-Sterea | 99,902 | Urban | 55,770 | 0.558247082 | 28 |
Peloponissos-Sterea | 99,902 | Rural | 44,132 | 0.441752918 | 22 |
Attiki | 176,160 | Urban | 174,493 | 0.990537012 | 50 |
Attiki | 176,160 | Rural | 1,667 | 0.009462988 | 0 |
Crete-Aegean Islands | 61,126 | Urban | 36,374 | 0.595065929 | 30 |
Crete-Aegean Islands | 61,126 | Rural | 24,752 | 0.404934071 | 20 |
Background:
Table 1.3 is part of the sampling frame for the stratum “rural areas of the region of Peloponissos-Sterea”, where all schools have only one first grade school-class.
Questions to answer:
Task instructions:
Generate a simple random sample of 22 school classes using random number selection in a computer programme;
Identify the selected school classes in the sampling frame (table 1.3);
Calculate the sampling fraction in terms of PSU.
R coding tips:
The sampling frame (list of school classes in the Peleponissos-Sterea
region) can be imported from the Sampling_frames.xlsx
workbook (table1_3
is the name of the worksheet).
The simple random sample of 22 classes can then be identified using
the dplyr
convenience function slice_sample()
to return a random subset of 22 rows of the original sampling frame.
Note that you must ensure you always get the same number back by
using the set.seed()
function - this way if you send your
script to someone else they would be able to reproduce the same sample.
The function set.seed()
must be used right before the
function selecting the sample.
Table 1.3 - Simple Random Sample of 22 school classes
# Import the sampling frame of school classes in Peleponissos-Sterea:
class_frame <- rio::import(here::here("data", "Sampling_frames.xlsx"),
which = "table1_3")
# Set seed to ensure you always get produce the same result
# (the number in set.seed can be anything you want)
set.seed(26598)
# Perform a simple random sample to select 22 classes from the list:
msrs <- class_frame %>%
# Subset the simple random sample of 22 classes:
slice_sample(n = 22) %>%
# Arrange results in ascending order by ID:
dplyr::arrange(ID)
#################
# Print results:
################
# Convert to flextable:
msrs_pub <- flextable::qflextable(msrs) %>%
flextable::set_table_properties(width = 1, layout = "autofit")
# Print:
msrs_pub
ID | School_code |
---|---|
1 | ACH-013 |
2 | ACH-014 |
4 | ACH-027 |
5 | ACH-036 |
14 | ACH-105 |
25 | ACH-137 |
37 | ACH-185 |
44 | ACH-200 |
52 | ACH-230 |
53 | ACH-265 |
56 | ACH-276 |
58 | ACH-290 |
63 | ACH-336 |
75 | ACH-430 |
79 | ILIA-016 |
83 | ILIA-028 |
89 | ILIA-036 |
90 | ILIA-037 |
93 | ILIA-043 |
95 | ILIA-045 |
97 | ILIA-048 |
102 | ILIA-062 |
The sampling fraction is the number of units sampled divided by the total number of units in the sampling frame, i.e.:
\[ Sample Fraction = \frac{n}{N} \]
where:
\(n\) is the sample size of PSU (school classes) and
\(N\) is the total number of PSU (school classes) included in the sampling frame.
# Calculate the sampling fraction for this example:
sample_fraction <- round(22/nrow(class_frame), digits = 2)
#################
# Print results:
################
sample_fraction
## [1] 0.22
Background:
This section is optional. Systematic sampling is similar to simple random sampling, except that the total sampling frame is divided by the desired number of sampling units to select and every qth row of the sampling frame is selected according to the result. This can be a useful sampling strategy when subjects are being prospectively recruited to a study, when the total size of the sampling frame is not yet clear or when it is necessary to ensure consistent sampling over multiple days with an unknown number of subjects recruited per day.
Questions to answer:
Task instructions:
Use the same sampling frame as for the previous section
(class_frame
);
Divide the total number of classes in the sampling frame by the desired number of units to sample (n=22) to calculate the size of the interval (q); round down q if it is not a whole number;
Randomly select one between 1 and q;
Select every qth row of the sampling frame to produce the systematic sample;
Limit the number of returned rows to the desired number of units to sample (n=22).
R coding tips:
After calculating the size of the groups to divide the sampling frame
into, we can select a random number from this group size using the base
R function sample.int()
.
Remember to use the set.seed()
function to ensure you
always get the same number back.
This random number can then be fed into the
dplyr::slice()
command as shown below to select every
qth row of the sampling frame. Note that if q is small
/ an early number in the sequence, an excess of rows may be selected. If
this happens, the returned rows can be limited to the desired sample
size by using the dplyr::slice_head()
function:
Table 1.4 - Systematic sample of school classes
# Define the sampling interval - the n to systematically sample
# (i.e. every nth row to choose):
sample_n <- as.integer(nrow(class_frame)/22)
# Set seed to ensure you always get produce the same result
set.seed(26598)
# Calculate the random starting point for systematic sampling:
sample_start <- sample.int(n = sample_n, size = 1)
# Define rows systematically (every nth row) to filter the sampling frame:
sample_rows <- seq.int(sample_start, nrow(class_frame), sample_n)
# Now subset the sampling frame and filter for every nth row:
msystematic <- class_frame %>%
# Select every nth row from the sampling frame:
slice(sample_rows) %>%
# Limit the number of returned rows to the desired sample size (22)
slice_head(n = 22)
#################
# Print results:
################
# Convert to flextable:
msystematic_pub <- flextable::qflextable(msystematic) %>%
flextable::set_table_properties(width = 1, layout = "autofit")
msystematic_pub
ID | School_code |
---|---|
4 | ACH-027 |
8 | ACH-055 |
12 | ACH-093 |
16 | ACH-107 |
20 | ACH-118 |
24 | ACH-129 |
28 | ACH-159 |
32 | ACH-173 |
36 | ACH-184 |
40 | ACH-191 |
44 | ACH-200 |
48 | ACH-215 |
52 | ACH-230 |
56 | ACH-276 |
60 | ACH-294 |
64 | ACH-338 |
68 | ACH-359 |
72 | ACH-375 |
76 | ACH-519 |
80 | ILIA-020 |
84 | ILIA-029 |
88 | ILIA-034 |
Advantages:
As mentioned in the background to this task, systematic sampling can be useful when you need to undertake prospective recruitment from a theoretical sampling frame of unknown size (like the number of patients attending a clinic on a given day). However, in the context of this case study, we already have a well defined and complete sampling frame (school classes, which also list the students in each class). Because of this, a systematic sampling strategy won’t provide any benefits over other methods, and we can adopt the most straight-forward approach – simple random sampling!
Disadvantages:
Bias can be introduced, if the sampling units are grouped in such a way that particular characteristics occur at regular intervals Therefore, the assumption made with systematic sampling is that there is no underlying pattern in the sampling frame.
Background:
The researchers originally chose to use school classes as the sampling units, because it was assumed that all classes had similar numbers of students. An alternative sampling unit would be the schools themselves; however unlike for classes, the number of students in each school varies widely.
Probability proportional to size (PPS) sampling is a method that takes varying sample sizes in the sample units into account. This helps to avoid under-representing one subgroup in a study and yields more accurate results.
In this case study, you have been provided with a list of schools in
urban areas of Attiki, which also indicates the number of grade 1
students in each school and the number of classes. This data set can be
imported from the table1_5
worksheet in the
Sampling_frames.xlsx
workbook. Schools can be selected
adjusting for the size of the student population, by performing PPS
sampling. The column containing the number of students in each school is
used as input.
Questions to answer:
Task instructions:
Use the sampling frame for Attiki schools in urban areas
presented in table1_5
Select 12 schools by probability proportional to size sampling without replacement
List the schools that have been selected
R coding tips:
There are several packages available on CRAN that can perform PPS
sampling. In the example code below, we will use the
sampford()
function from the pps
package to
perform the sampling without replacement, which uses the classic method
developed by Sampford
(1967).
We can use the indices in the results to filter the sampling frame
for the 12 selected schools (using dplyr::slice()
as
above).
Note that an alternative would be to use
dplyr::slice_sample()
and supply probabilities for each
school to the weight_by
argument. To do this you would have
to calculate probabilities for each school:
school_frame %>% mutate(probability = Students / sum(Students))
.
We opted for using the pps::sampford()
function, as the
documentation on methods is more transparent.
Table 1.5 - School classes selected by Probability Proportional to Size sampling
# Import the sampling frame of urban schools with grade 1 classes in Attiki:
school_frame <- rio::import(here::here("data", "Sampling_frames.xlsx"),
which = "table1_5")
# Perform PPS sampling to select 12 schools:
mpps <- school_frame %>%
# Select schools by PPS without replacement:
slice(pps::sampford(size = Students, n = 12)) %>%
# Arrange in ascending order by ID:
arrange(ID)
#################
# Print results:
################
# Convert to flextable:
mpps_pub <- flextable::qflextable(mpps) %>%
flextable::set_table_properties(width = 1, layout = "autofit")
# Print:
mpps_pub
ID | Code | School | Classes | Students |
---|---|---|---|---|
14 | ATT-937 | ΔΗΜΟΤΙΚΟ ΚΕΡΑΤΕΑΣ 3ο | 4 | 112 |
15 | ATT-938 | ΔΗΜΟΤΙΚΟ ΣΑΡΩΝΙΔΑΣ | 4 | 96 |
16 | ATT-943 | ΔΗΜΟΤΙΚΟ ΑΘΗΝΩΝ 115ο | 4 | 76 |
22 | ATT-956 | ΔΗΜΟΤΙΚΟ ΜΑΡΟΥΣΙ 11ο | 4 | 72 |
23 | ATT-966 | ΔΗΜΟΤΙΚΟ ΗΛΙΟΥΠΟΛΗΣ 13ο | 5 | 105 |
25 | ATT-974 | ΔΗΜΟΤΙΚΟ ΝΕΑ ΣΜΥΡΝΗ 8ο | 3 | 81 |
32 | ATT-096 | ΔΗΜΟΤΙΚΟ ΖΩΓΡΑΦΟΥ 10ο | 1 | 18 |
35 | ATT-103 | ΔΗΜΟΤΙΚΟ ΑΘΗΝΩΝ 138ο | 5 | 100 |
36 | ATT-104 | ΔΗΜΟΤΙΚΟ ΤΑΥΡΟΥ 5ο | 5 | 105 |
41 | ATT-114 | ΔΗΜΟΤΙΚΟ ΜΕΛΙΣΣΙΑ 3ο | 3 | 72 |
43 | ATT-120 | ΔΗΜΟΤΙΚΟ ΑΧΑΡΝΩΝ 12ο | 2 | 42 |
47 | ATT-138 | ΔΗΜΟΤΙΚΟ ΑΓΙΟΙ ΑΝΑΡΓΥΡΟΙ 9ο | 4 | 88 |
Selection “without replacement” suggests that after selection, a school is NOT included in the draw and may not be selected again. Therefore, if a school is “selected” for a second or a third time, this is not taken into account.
Background:
Cluster sampling and stratification was used in this survey.
Questions to answer:
Cluster sampling and stratified sampling:
1. Cluster sampling:
Basic idea:
group sampling units (or, better, use “natural” grouping of sampling units; e.g. students grouped in school-classes or schools)
select a sample of these groups (clusters).
Effect on precision:
decrease in precision of the outcome measure (e.g. proportion vaccinated) compared to simple random sampling (for the same sample size), in other words, the confidence intervals in a cluster sample tend to be wider than those from a simple random sample.
homogeneity between clusters for the outcome measure (e.g. proportion vaccinated) and heterogeneity within clusters for the outcome measure. This is an advantage: precision increased
Extreme situation:
all clusters are 100% homogeneous between them with regard to the outcome measure, e.g. there is exactly the same vaccination coverage in all school-classes
as if clusters were part of the same simple random sample
Logistics: - Since the units (individuals, children) are grouped, then it saves time and effort by not having to travel to reach other units (individuals, children); in one visit one can reach several units (individuals, children).
2. Stratified sampling:
Basic idea:
group sampling units with the same or similar characteristics (e.g. region, gender, age group); the characteristic is not the outcome of interest.
select a sample in each group (stratum) separately
Effect on precision:
Increase in precision compared to simple random sampling (for the same sample size), that is, the confidence intervals in stratified sampling tend to be narrower than in simple random sampling.
heterogeneity between strata with regard to the outcome measure (e.g. different proportion vaccinated in each stratum) and homogeneity within strata with regard to the outcome measure. This is an advantage: precision increased
Extreme situation:
all strata are 100% heterogeneous between themselves, e.g. extreme differences in vaccination coverage in different regions
as if strata are comprised of totally different populations
advantage in employing separate sampling procedures rather than mixing the populations
We will discuss these issues in more depth in the following session.
Session overview
One of the most important factors to consider in the design of a survey is the choice of an appropriate sample size. Studies that are too small may fail to detect important effects or obtain an imprecise estimate. Studies that are larger than necessary are a waste of time, money and other resources. In this session, you will perform all the necessary calculations to obtain the required sample size for a single proportion. The formulas are provided for each approach, so that you can initially perform your own calculations using a calculator. The R coding tips for this session then show you how to perform the same calculations using R.
A presentation on “Sample size calculation” should precede this session.
This session consists of 3 Tasks:
Task 2.1 – Sample size calculation for a single proportion
Task 2.2 – Sample size calculation for a single proportion using the design effect
Task 2.3 – Sample size calculation using R
Note that this case study covers only sample size calculations for a single proportion based on the precision of a study. It is not intended to cover sample size calculations based on the assessments of sample size or power for studies that compare proportions, means or incidence rates between two groups.
Background:
The formula used to calculate confidence intervals, assuming simple random sampling can be rearranged to calculate sample size as below:
\[ n = \frac {z^2 · p · (1-p)}{d^2} \]
However, this does not take population size into account. An adjusted version of the formula that includes a finite population correction is presented below:
\[ n = \frac {z^2 · p · (1-p) ·N } {[d^2 · (N-1)] + [z^2· p(1-p)]} \]
where:
\(n\) = sample size
\(N\) = total population
\(d\) = precision of the estimate or Standard Error (SE)
\(z\) = number of standard errors away from the mean of the sampling distribution (\(z = 1.96\) for 95% CI)
\(p\) = estimated proportion in the sample
Note that this expression incorporates the finite population correction (fpc) \(1 - (\frac {n}{N})\) . If the population is very large relative to the sample (i.e. small sampling fraction), the fpc approximates 1 and can be reasonably omitted. If the entire population is included in the sample, the fpc becomes 0 and so does the standard error. As a rule of thumb, the fpc should not be ignored if the sampling fraction \((\frac {n}{N})\) exceeds 5%-10% or if the total sampling population is less than 10 000 people. In that case, if the fpc is not used in the formula, the confidence intervals will be wider and the sample size larger than necessary. Otherwise, the size of the total population from which the sample is drawn has little effect in practice, and may be ignored.
Previous vaccination coverage surveys in the country have shown that vaccination coverage exceeded 85% for most vaccines. As the country was divided into 6 regions for this study, each region was considered as a separate survey and calculations were performed for each one to give the sample size with a fixed precision. The researchers decided to estimate the vaccine coverage for each of the 6 regions of the country with sufficient precision of ± 4%. This way, the precision of the overall national estimate was somewhat better than that for any single region. They also considered the response rate for a previous vaccine coverage survey of similar magnitude, which was 87.5%. The sample size was adjusted with a buffer to account for an estimated 12.5% of children in the sampling frame being non-respondents.
The following table presents the total number of pupils in the target class and age group for each region.
Region | Number of pupils (sampling frame) |
---|---|
Thraki | 4201 |
Macedonia-Thessalia | 31045 |
Ipeiros-Ionian Islands | 5055 |
Peloponnisos-Sterea Ellada | 17741 |
Attiki | 30586 |
Kriti-Aegean Islands | 12218 |
Greece (total) | 100846 |
Questions to answer:
Parameters required to calculate sample size:
The required sample size should be determined by focusing on the primary objective of the study (e.g. the estimation of vaccination coverage among 6-year-old children in Greece in this survey). To calculate the number of participants required to achieve the objectives of the survey, you should specify the following:
Expected frequency of the principal outcome measure (based on the main objective of the study) (e.g. expected vaccination coverage): The expected frequency may be obtained from previous surveys of similar subject matter. Note that the sample size will be largest when \(p = 0.5\) and it decreases slowly as the difference between \(p\) and \(0.5\) increases. Hence, if the expected frequency is unknown, choosing \(0.5\) for \(p\) will always provide enough observations, irrespective of the true population proportion.
The desired precision of the study: To select the appropriate sample size it is necessary to decide how precise you would like your estimate (e.g. vaccination coverage) to be. The amount of “precision” (sampling error) is represented by the width of the confidence interval around your estimate. The narrower the confidence interval (CI), the greater the precision of the estimate, but the higher the number of subjects that you need to include in the study. It is therefore necessary to decide on the width of an acceptable confidence interval. Hence, you should specify to within how many percentage points of the population value your sample estimate should be. For example, if the expected vaccination coverage is 50%, you may wish your sample estimate to be within 1%, giving a confidence interval of 49% to 51%.
The quantity z, that represents the number of standard errors away from the mean of the sampling distribution: For a level of significance of 5% (95% CI), \(z = 1.960\), whereas for significance levels of 10% (90% CI) or 1% (99% CI), \(z = 1.645\) or \(z = 2.576\), respectively.
Regional sample size calculation - task description:
Calculate the single proportion sample size required for each region, assuming simple random sampling;
Use the per region student population from table 2.1, previous vaccination coverage estimate and the desired precision presented in the background information above in your calculations;
Adjust your calculations to take into account the lower response rate (87.5%) achieved in previous surveys.
R coding tips:
The sampling frame sizes by region in table 2.1 can be imported into
R from the table2_1
worksheet in the
Sampling_frames.xlsx
workbook.
To calculate sample size for each region, we can use the
sampler
package. In the first instance, the
rsampcalc()
function can be used for each region. The
function arguments correspond to the parameters above as follows:
\(N\) = population (i.e. number of students in each region from table 2.1)
\(e\) = tolerable margin of error (i.e. desired precision of +/- 4%)
\(ci\) = confidence interval (i.e. 95% - the default value)
\(p\) = anticipated response distribution (convert 85% vaccination coverage to proportion i.e. 0.85)
\(over\) = desired over-sampling proportion (convert 87.5% response to inverse proportion i.e. 1 - 0.875)
Note that the last argument, over
can be used to
increase the sample size in order to compensate for non-responders
(individuals in the sample who choose not to participate). The
compensation is done by taking the inverse proportion of the response
rate from previous surveys (1 - 0.875) and adding this to the calculated
sample size as a buffer.
To first have a look at the sample sizes without this buffer, leave this argument on its default setting (0).
We will add the sample size calculations with and without a buffer to
the table, using the dplyr
package. We will then use the
qflextable()
function from the flextable
package to display the results in a printable format.
Table 2.1 - Estimated sample size by region and urban / rural residence area
#########################
# Import data
#########################
# Import the sampling frame of urban schools with grade 1 classes in Attiki:
region_frame <- rio::import(here::here("data", "Sampling_frames.xlsx"),
which = "table2_1")
#########################
# Calculate sample size
#########################
# Calculate sample size (number of students to survey) for each region:
region_frame <- region_frame %>%
# Create new column fpc and calculate per-region sample size
# (with finite population correction):
mutate(fpc = sampler::rsampcalc(N = Student_pop,
e = 4,
ci = 95,
p = 0.85,
over = 0)) %>%
# Create new column buffer and calculate sample size again
# this time adding a buffer for a response rate of 87.5%:
mutate(buffer = sampler::rsampcalc(N = Student_pop,
e = 4,
ci = 95,
p = 0.85,
over = 1 - 0.875))
#########################
# Present in a table
#########################
rfpub <- region_frame %>%
# Convert data.frame to flextable:
flextable::qflextable()
# Print the table:
rfpub
Region | Student_pop | fpc | buffer |
---|---|---|---|
Thraki | 4,201 | 286 | 322 |
Macedonia-Thessalia | 31,045 | 304 | 342 |
Ipeiros-Ionian islands | 5,055 | 289 | 326 |
Peloponnisos-Sterea Ellada | 17,741 | 301 | 339 |
Attiki | 30,586 | 304 | 342 |
Kriti-Aegean islands | 12,218 | 299 | 337 |
If you calculate sample size using the first formula (without a finite population correction) this yields:
\[ \frac {1.96^2 · 0.85 · (1-0.85)}{0.04^2} = 306 \]
In the fpc
results column in the table (which represents
sample size with a finite population correction but without adjustment
for non-responders), you can see that regions with a population size of
more than 10 000 (such as Macedonia-Thessalia or Attiki) have sample
sizes very close to 306, whereas the regions with populations less than
10 000 (such as Thraki and Ipeiros-Ionian Islands) have smaller sample
sizes when the finite population correction is applied.
The buffer
results column shows the sample size with a
buffer to account for an estimated 12.5% of the sample frame being
non-responders. This can also be calculated from the original sample
size with a calculator using the following formula:
\[ buffered = n \cdot (1 + pnr) \]
where:
\(buffered\) is the sample size with a buffer added for the estimated proportion of non-responders
\(n\) is the original sample size (with or without the finite population correction)
\(pnr\) is the estimated proportion of non-responders
Background:
So far, you have calculated the required sample size, assuming simple random sampling. However, the more complex sampling methods used in this study result in additional variability in the sample estimate and therefore require a larger sample size. For example, in this sampling design there is variability:
between clusters (school classes)
between pupils within the clusters
To account for the additional variability at the different stages of complex designs, the sample size and sample estimates can be adjusted by a factor known as the design effect (\(deff\)). This compares the variance (i.e. the square of the Standard Error (SE)) of estimates from the more complex design used, to the variance that would come from the same sample size if simple random sampling had been used.
For cluster sampling, the variance can be calculated with the following formula:
\[ SE^2 = \frac {\sum(p_i-p)^2} {m \cdot (m-1)} \cdot (1-m/M) \]
where:
\(SE^2\) is the variance (square of the Standard Error)
\(p_i\) is the proportion (e.g. vaccination coverage) in each cluster
\(p\) is the estimated proportion for the whole population (e.g. 85% vaccination coverage)
\(m\) is the number of clusters (PSU) selected in the sample (e.g. 342 school classes in this study)
\(M\) is the total number of clusters in the population (e.g. XX school classes in the whole country)
The design effect can then be calculated by:
\[ deff = \frac {SE^2 \text { from complex design}} {SE^2 \text { from simple random sampling}} \]
The sample size will increase by the amount of the design effect. For example, if the design effect is estimated as 1.5, this means that in order to obtain the same precision, 50% more individuals must be studied with the complex design than with the simple random sampling strategy.
When cluster sampling is used, subjects belonging to a particular cluster (e.g. household, community, school class) tend to have characteristics which make them more similar to each other than subjects belonging to other clusters. If the outcome of interest is more likely to occur within certain clusters than others, the estimates of prevalence will be less precise than if individuals were selected at random. For this reason, the sample size required to achieve a certain level of precision must be larger for cluster sampling than simple random sampling. As mentioned earlier, to allow for the additional variability, the design effect should be estimated. Apart from the previously presented formula, the design effect can also be calculated with the intra-cluster correlation coefficient, or \(rho\) :
\[ deff = 1 + (n - 1) \cdot rho \]
where:
\(n\) is the average number of subjects per cluster and
\(rho\) is the intra-class correlation coefficient or rate of homogeneity for the outcome of interest.
This is a measure of how different people are within a cluster (compared to people chosen randomly). This coefficient can range, in theory, from very small negative values or zero, when subjects within each cluster tend to be very diverse or representative of the sampling population (heterogeneity), to a maximum of one when subjects within each cluster are similar, but differ from cluster to cluster (homogeneity). In practice, the value of \(rho\) may be between \(0.05\) and \(0.4\) depending on the amount of variation between clusters and the variables of interest.
The value of \(rho\) (or equivalently of \(deff\)) can be estimated from previous surveys of similar design and subject matter. If the intra-class correlation coefficient is not available, plausible values must be estimated. Alternatively, to estimate \(rho\), you could perform a pilot study, sampling an equal number of people (\(n\)) from some clusters and then estimating the vaccination coverage and the design effect. Subsequently, you could rearrange the above formula to calculate \(rho\) instead of \(deff\).
The previous round of this vaccination coverage survey (conducted in 1997) estimated a \(rho\) of \(0.05\). The mean number of pupils in first grade school classes was \(20\).
Questions to answer:
Number of subjects required for complex design:
Using the formula to calculate design effect by comparing the variance for simple and complex sampling, you should get:
\[ \frac {0.0265^2} {0.0246^2} = \text {design effect of } 1.16 \]
This suggests that for the complex design you would need a 16% higher number of subjects compared to those needed for simple random sampling to achieve the same precision. This means that you would need:
\[ 200 \times 1.16 = 232 \text { subjects} \]
Role of selected cluster size on required sample size:
If the number of subjects in each cluster (\(n\)) is large, the product \((n-1) \cdot rho\) and the design effect will also be large, suggesting a bigger required sample size. To minimise the design effect, you should minimise the size of each cluster (\(n\)). Therefore, it follows from the formula that a large number of small clusters is better – provided that other things are equal - rather than a small number of large clusters.
Recalculating regional sample size taking into account the design effect:
Calculating the design effect with \(n=20\) and \(rho=0.05\), you should get:
\[ 1 + (20 - 1) \cdot 0.05 = \text {design effect of } 1.95 \]
Note that \(rho\) is more likely to be constant from one survey to another, rather than the design effect that depends on the cluster sample size.
You previously estimated that you would need 350 children from each region, if simple random sampling were used. A design effect of 1.95 suggests that you would need almost twice the number of children you would have recruited if you had used simple random sampling, e.g.:
\[ (350 \cdot 1.95) \cdot 6 \text { regions} = 4095 \text { children in total} \]
The number of clusters (school classes) required for a study can be estimated by dividing the total number of children required by the average cluster (school class) size. As the mean number of pupils in first grade school classes was 20, you should get:
\[ \frac {4095} {20} = 205 \text { clusters (school classes)} \]
Note that stratification usually leads to a small reduction in the standard error of the overall estimate \(p\) (and / or equivalently in the sample size), compared to the error that would have been obtained if the survey had not been stratified. However, the improvement in precision cannot be quantified adequately to allow its use in sample-size calculations. Therefore, in this study where stratification was used, the required sample size to achieve the same precision, may have been smaller compared to the one that would have been required if stratification had not been used. You may also notice that the overall sample size used in the study was slightly different from the one estimated here. This is because some minorities were oversampled to achieve more accurate estimates for these groups.
Task description for R code:
Use a rho of 0.05 and mean cluster (class) size of 20 to calculate the design effect;
Adjust the sample sizes calculated in task 2.1 to account for the design effect;
Calculate the total number of clusters required using the adjusted sample size.
R coding tips:
As for the calculation by hand / calculator, we will use the same formula to calculate the design effect in R. Note that R can be used as a calculator for simple mathematical operations; this allows these calculations to be incorporated into an R script and then applied to a column of values.
Using the table region_frame
that we created in the
previous task, we will then multiply each original sample size in the 6
regions by the calculated design effect. We will display the results in
an updated flextable
.
Next, we can add the adjusted sample sizes of all the regions together to obtain the total sample size needed for the study.
Finally, we can divide the total adjusted sample size (sum of number of students to survey in the 6 regions) by the average cluster (class) size (20).
Note that we will use the base R function round()
to
round up sample sizes to have no digits after the decimal point, since
sample sizes need to be expressed in whole numbers.
Table 2.3 - Regional sample size with adjustments for population, non-responders & design effect
# Calculate the design effect given n = 20 and rho = 0.05:
deff <- 1 + (20 - 1) * 0.05
# Add new rho-adjusted sample size calculation to the table:
region_frame <- region_frame %>%
# Calculate sample size with adjustment for the design effect:
mutate(ndeff = round(buffer * deff, digits = 0))
# Calculate the number of clusters (school classes) required:
nclusters <- round((sum(region_frame$ndeff)) / 20, digits = 0)
#################
# Print results:
################
# Print number of clusters required:
nclusters
## [1] 196
# Convert to flextable:
rfpub <- flextable::qflextable(region_frame) %>%
flextable::set_table_properties(width = 1, layout = "autofit")
# Print table:
rfpub
Region | Student_pop | fpc | buffer | ndeff |
---|---|---|---|---|
Thraki | 4,201 | 286 | 322 | 628 |
Macedonia-Thessalia | 31,045 | 304 | 342 | 667 |
Ipeiros-Ionian islands | 5,055 | 289 | 326 | 636 |
Peloponnisos-Sterea Ellada | 17,741 | 301 | 339 | 661 |
Attiki | 30,586 | 304 | 342 | 667 |
Kriti-Aegean islands | 12,218 | 299 | 337 | 657 |
Note that the total number of clusters (school classes) required in the study is slightly smaller when the sample sizes are calculated with the finite population of each region (\(n = 195\)), compared to sample size calculated without taking into account regional differences in population (\(n = 205\)).
Session overview
In this session, you will consider how the analysis plan leads to a data collection instrument.
The session consists of the following task:
To complete the work in this session you will need:
Background:
Following the definitions of the objectives, you should:
define the indicators, for each objective
define data to be collected, for each indicator
define data source for each piece of information to be collected
define the data collection method, for each piece of information to be collected
Methods used to collect data are divided into quantitative and qualitative methods. To select the most appropriate data collection method for your study, you should ask yourself the following four questions:
Will the chosen method help to meet the objectives of the study?
Is the method appropriate for my study design?
Is the method feasible and practical?
Do I have the funds to use this method?
Task instructions:
Based on the objectives of the study (presented in session 1) answer the following questions:
Which indicators would you use for each objective?
Based on these indicators, what type of data would you like to collect?
Choose the appropriate method to collect these data.
Create a table for your answers with the following columns:
a. Objective
b. Indicator
c. Data required
d. Data source(s)
e. Data collection method(s)
Data collection plan:
The study aimed to collect information on:
Objective 1 - vaccination status of children, nationally, by geographic area and by minority group
Objective 2 - demographic characteristics of children and parents/guardians
Objective 3 - parental perceptions on immunisation and attitudes towards vaccination
Objective 4 - barriers to vaccination as perceived by parents/guardians
Indicators, possible data sources and data collection methods to achieve each objective are listed in the tables below.
Parameter | Details |
---|---|
Indicator |
|
Data type(s) |
|
Data source(s) |
|
Data collection |
|
Parameter | Details |
---|---|
Indicator | Descriptive:
Analytic:
|
Data type(s) |
|
Data source(s) |
|
Data collection |
|
Parameter | Details |
---|---|
Indicator | Descriptive:
Analytic:
|
Data type(s) |
|
Data source(s) |
|
Data collection |
|
Parameter | Details |
---|---|
Indicator | Descriptive:
Analytic:
|
Data type(s) |
|
Data source(s) |
|
Data collection |
|
The researchers used three sources to collect this information:
School registry to obtain basic information about the school, including a list of all selected school classes and pupils and basic demographic characteristics of the selected children. An extraction form is provided in Annex 1.
Child vaccination booklet to gather information regarding vaccination status of children. In Greece, each child has a paper health booklet or vaccination card. The booklets/vaccination cards were photocopied and information was abstracted in a Vaccination Abstraction Form (VAF, Annex 3) that was based on the National Vaccination Programme (NVP) of Greece (Box 1).
Questionnaire regarding beliefs and attitudes of parents/guardians towards immunization, perceived barriers to vaccination and parental socioeconomic characteristics. Qualitative methods could have been used to gather information on perception on immunization.
The self-administered questionnaire to gather information on parental perceptions of parents/guardians towards immunization of their children (Annex 2) addressed the following issues:
Beliefs on perceived benefits or harms of vaccination
Perceived safety of immunizations
Accuracy of recommendations provided by physicians
Perceived financial, structural or health care service barriers to vaccination
Sources of information that parents/guardians trust on immunization
Demographic and social characteristics of parents
Health status of child
Identification number of respondent
The survey team consisted of 234 health professionals who received special training for the fieldwork and worked all over the country. A manual of field operations contained a detailed plan of organisational aspects and described the tasks and procedures of the study. The investigators visited the selected schools twice. During the first visit, members of the survey team handed an explanatory letter to the school headmaster providing detailed information regarding the study and asking for their cooperation. Subsequently, they handed a package for the parents/guardians containing:
a letter explaining the study and asking them to provide their child’s vaccination booklet
an anonymous self-administered questionnaire asking their attitudes towards immunisation
During the second visit, the investigators received the completed questionnaires and photocopied the child’s vaccination booklets.
The check-list of actions for fieldworkers is provided for information in Annex 4. The investigators then entered the data onto computers, cleaned the data and prepared for the analysis.
Session overview
You now have a clean data set and you are ready to analyse the data. In this session, you will:
check if the sample is representative of the target population
calculate the response rate and measure the potential differences between respondents and non-respondents
estimate frequencies (or means, as appropriate) of selected covariates
calculate sampling weights for each stratum
estimate vaccine coverage nationally and by geographic region using the weighted proportion of sampled children vaccinated , including coveragefor different vaccines
This session consists of the following tasks:
Task 4.1 - Check if the sample is representative of the target population
Task 4.2 - Response and differences between respondents and non-respondents
Task 4.3 - Calculate sampling weights
Task 4.4 - Calculate weighted proportions
Task 4.5 - Estimate vaccination coverage
To complete the work in this session, you will need:
The database vaccine4.dta
, which is derived from
two other data sets (vaccine.dta
and
school.dta
):
school.dta
contains basic information from the
school registry, including a list of all selected classrooms and pupils
and basic demographic characteristics of the children
vaccine.dta
contain the vaccination status of
children from the vaccination booklet.
A description of the variables of the two datasets is provided in
Annex 5 and 6. You will find the vaccine4.dta
data set in
the data
folder for this case study (note: although
vaccine4.dta
is a STATA data set, it can be opened in R
with the rio
package - see R coding tips below).
Background:
The first step is to check whether the sampled population is representative of the target or source population. If this is true, you can extrapolate the study results to the population from which the sample was drawn.
Questions to answer:
Representativeness of the sampled population:
You could compare some available characteristics of the target population with those of the sampled population. For example, in this study, the age profile of participant children (mean age 6.76 years) closely resembled that of the total first year Grammar school pupils in Greece (mean age 6.68; \(p = 0.16\)). In addition, participant children did not differ significantly from all first year Grammar school children in terms of gender distribution (\(p = 0.39\)) and urban-to-rural proportion (\(p = 0.86\)). These suggest that the sample was representative of the source population, at least in terms of these characteristics (age, gender and urban-rural place of residence). Representativeness in terms of age and sex (and other characteristics of relevance) are suggestive that the outcome in the sample may represent the outcome in the total population; however it is impossible for us to know without sampling the whole population.
Background:
The investigators compared the characteristics of respondents and non-respondents and compared them in a table. In the data set, respondents are those for whom a vaccine book was received; non-respondents are children for whom a vaccine book was not received.
Task description:
The table below lists the variables required for this task, their names and encoding:
No. | Variable label | Variable name | Factor level encoding |
---|---|---|---|
1. | Receipt of vaccine book ( response) | vaccrec | No / no n -response = 0 Yes / response = 1 |
2. | Age of child | age | In years (with decimals) |
3. | Gender | gender | Female = 0 Male = 1 |
4. | Urban or rural area | urban | Rural = 0 Urban = 1 |
5. | Minority group status | minority | General p opulation = 0 Roma = 1 Greek Muslims = 2 I mmigrants = 3 |
6. | Country of birth | country1 | Other country = 0 Greece = 1 |
R coding tips:
The vaccine4.dta
data only contains data for sampled
students, but we can check for any differences between those that did
(respondents) and didn’t (non-respondents) have a vaccine book (note
that subjects who did not have a vaccine book were excluded from
subsequent analyses). This is encoded in the vaccrec
variable.
We will first import the data with the rio::import()
function. This very flexible function can be used to import many file
types, including .xlsx, .csv and (in this case) .dta (the data file was
exported from STATA after some initial cleaning steps).
Next, you may wish to explore the data. Using the
dplyr::describe()
command as well as the information in the
data dictionary (annex 5 and 6) will give you an overview of the data
set.
To ensure that the variables of interest are summarised correctly, we will convert them from the binary numeric format output from STATA to categorical labelled factors:
#################################################
# IMPORT DATA, LABEL VARIABLES AND FACTOR LEVELS
#################################################
# Import the vaccine coverage study data for session 4:
vaccine <- rio::import(here::here("data", "vaccine4.dta")) %>%
# Convert vaccrec to a factor and add nice display labels:
mutate(vaccrec = factor(vaccrec,
levels = c(0, 1),
labels = c("Non-respondents", "Respondents"))) %>%
# Convert gender to a labelled factor:
mutate(gender = factor(gender,
levels = c(0, 1),
labels = c("Female", "Male"))) %>%
# Convert residence area type to a labelled factor:
mutate(urban = factor(urban,
levels = c(0, 1),
labels = c("Rural", "Urban"))) %>%
# Convert minority to a factor and add nice display labels:
mutate(minority = factor(minority,
levels = c(0, 1, 2, 3),
labels = c("General population",
"Roma",
"Greek Muslims",
"Immigrants"))) %>%
# Convert country1 (whether from Greece or not) to a labelled factor:
mutate(country1 = factor(country1,
levels = c(1, 0),
labels = c("Greece", "Other country"))) %>%
# Label variables with pretty names for the table:
## variable name = variable label
labelled::set_variable_labels(vaccrec = "Response status",
age = "Age in years",
gender = "Gender of pupils",
urban = "Area type",
minority = "Minority group",
country1 = "Country of birth")
Now that our data is in the correct format, we can complete table 4.2
with summary statistics. To do this we will use the
tbl_summary()
function from the gtsummary
package. This is a very useful package that facilitates creation of
nicely formatted and publishable summary tables, where you can specify
what methods you want to use to summarize the data. You can also choose
to apply different tests or summary methods to categorical and numeric
variables, with the tidy::select()
commands
all_categorical()
and all_continuous()
,
respectively.
To create the summary table, we stratify by the vaccrec
variable (which indicates if a vaccine book was received or not and is a
proxy for survey response). We will include the following summary
statistics in the table:
t.test()
)chisq.test()
)You can use the results of these tests to determine if the respondents are sufficiently representative of the whole population in the sampling frame, or if there is a bias in the non-responder group that needs to be taken into account. This is quite complex code so we break it in to three parts. First we use gtsummary to create the statistical table, then we use gtsummary to format the table and finally we create a publication-ready table with flextable.
The first step (creating the statistical table) is the most complex -
but this is an important first step to understand as it will allow you
to use a large variety of statistical tests on your data in the future!
The gtsummary::tbl_summary()
function takes several
arguments within lists()
, while this seems like a scary
concept, all that it does is allows you to define several different
arguments at the same time.
#################################################
# CREATE SUMMARY TABLE BY RESPONSE STATE
#################################################
# Step 1: create the statistical table with p-values
# Summarise the demographics of the two groups (+/- vaccine book) in a table:
reptable <- vaccine %>%
# Select subset of demographic variables to include in the summary table:
dplyr::select(vaccrec, age, gender, urban, minority, country1) %>%
# Create summary table with gtsummary:
gtsummary::tbl_summary(
# Stratify by group (no vaccine book = no response, has book = response):
by = vaccrec,
# Add summary stats (mean + SD for continuous, proportions for categorical):
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
# Exclude NA (missing values) from the table:
missing = "no") %>%
# Add tests of statistical significance of differences between groups:
add_p(test = list(all_continuous() ~ "t.test", # Continuous variables
all_categorical() ~ "chisq.test"), # Categorical variables
# Identify grouping variable:
group = vaccrec,
# Define any test arguments that deviate from the default:
test.args = list(all_tests("t.test") ~ list(var.equal = FALSE),
all_tests("chisq.test") ~ list(simulate.p.value = TRUE)))
The second step (formatting the table) allows you to use
gtsummary
functions to choose how the table looks and add
in some extras, such as a total count column, labels (in bold and
italicised) and column headers.
# Step 2: format the statistical table
# Use the table created above
reptable <- reptable %>%
# Add total numbers to column headers:
add_overall() %>%
# Add descriptors of the stats presented to row labels:
add_stat_label() %>%
# Individually label which row had which statistical test:
separate_p_footnotes() %>%
# Make variable names bold and italics:
bold_labels() %>%
italicize_labels() %>%
# Tidy up column headers so stats are on second line:
modify_header(update = list(all_stat_cols(FALSE) ~ "**{level}**\nN = {n}",
stat_0 ~ "**Overall**\nN = {N}"))
Finally we can use flextable
as done previously to
create a publication-ready table.
# Use the table created above
reptable <- reptable %>%
# Convert to flextable to left-align to page margin:
as_flex_table() %>%
# Autofit columns to fit headers and content on one line:
flextable::set_table_properties(width = 1, layout = "autofit")
#################
# Print results
################
reptable
Characteristic | Overall | Non-respondents | Respondents | p-value |
---|---|---|---|---|
Age in years, Mean (SD) | 6.57 (0.53) | 6.65 (0.68) | 6.56 (0.51) | 0.0131 |
Gender of pupils, n (%) | 0.152 | |||
Female | 2,180 (50%) | 261 (53%) | 1,919 (50%) | |
Male | 2,180 (50%) | 231 (47%) | 1,949 (50%) | |
Area type, n (%) | <0.0012 | |||
Rural | 974 (22%) | 76 (15%) | 898 (23%) | |
Urban | 3,413 (78%) | 433 (85%) | 2,980 (77%) | |
Minority group, n (%) | <0.0012 | |||
General population | 3,544 (82%) | 323 (74%) | 3,221 (83%) | |
Roma | 79 (1.8%) | 27 (6.2%) | 52 (1.3%) | |
Greek Muslims | 328 (7.6%) | 38 (8.7%) | 290 (7.5%) | |
Immigrants | 352 (8.2%) | 49 (11%) | 303 (7.8%) | |
Country of birth, n (%) | 0.22 | |||
Greece | 4,023 (94%) | 379 (93%) | 3,644 (94%) | |
Other country | 245 (5.7%) | 29 (7.1%) | 216 (5.6%) | |
1Welch Two Sample t-test | ||||
2Pearson's Chi-squared test with simulated p-value |
Summary of respondent vs. non-respondent characteristics:
In this study, respondent and non-respondent children did not differ in their sex distribution. But they differed significantly in terms of e.g. minority group (\(p<0.001\)), as immigrants or Roma were less likely to participate in the study. However, due to the high response (88%) the corresponding distributions between participant and all sampled children were similar.
Sources of potential bias:
Selection bias may be introduced if non-respondents differ significantly in terms of some characteristics compared with respondents and these characteristics are associated with vaccination coverage; e.g. if the age distribution of respondents is statistically significant different from that of non-respondents. Children without a vaccination booklet (non-respondents) may be less likely to have been vaccinated (e.g. their parents may be less likely to have complied with vaccination recommendations) and therefore the vaccination coverage of respondents may have been overestimated.
Definition of main outcomes:
The investigators used the following definitions:
A. Complete vaccination status (vacful
):
Children were considered fully vaccinated if they had received all of the following vaccinations:
5 doses of poliomyelitis vaccine
5 doses of DTP vaccine
2 doses of MMR vaccine
3 doses of hepatitis B (HBV) vaccine and
full vaccination for Haemophilus Influenza type b (Hib) ^[full Hib immunisation was considered as having received one of the following:]
B. Age-appropriate vaccination (vactime
)
Children were considered age-appropriately vaccinated if they were both completely immunised and if they had received their vaccinations at an appropriate age according to the national vaccination schedule^[i.e. if they had received all of the following, according to the NVP:].
As mentioned earlier, the following sampling strategy was chosen: The
country was stratified into six regions, which, apart from the capital
region (Attiki), were further subdivided into urban and rural areas, to
create 11 strata in total (strata
). In each region school
classes (clusters = school
) were randomly selected with
probability proportional to the total number of first year grammar
school pupils. The sampling frame of all school classrooms in the
country for the academic year 2004-2005, was obtained from the Greek
Ministry of Education. The list contained 108 538 pupils from both
public and private schools.
Task description:
Based on the above information, did all children have the same probability of being selected regardless of where they lived?
Which formula would you use to calculate the probability of selecting each child?
Taking into account the selected sampling scheme, how would you estimate the vaccination coverage of children in the country; would you calculate simple proportions, i.e.:
\[ \text {vaccination coverage} = \frac {\text {number of children vaccinated}}{\text {total number of children}} \]
If not, why not?
How would you calculate the sampling weights in this study?
Calculating and equalising the probability of selection:
As you saw in session 2, the sampling fractions used for each region were unequal (e.g. approximately 20% of children were sampled in Thraki, but only 3% in Attiki). This suggests that the probability of being selected was higher for children living in a region where a higher sampling fraction was used (e.g. Thraki) compared to regions with a lower sampling fraction (e.g. Attiki).
Note that according to the multiplicative law of probabilities, the probability of two independent events is given by the product of their individual probabilities. Two events are independent when the value taken by the first event tells us nothing about the value taken by the other, and vice versa. The probability (\(\pi\)) of selecting a pupil, is the product of the probability of selecting a cluster (school class) (\(\pi \text { cluster}\)) with the probability of selecting an individual when his class (cluster) has been selected (\(\pi \text { pupil in cluster}\)).
\[ \pi = \pi \text{ cluster} \cdot \pi \text{ pupil in cluster} \]
Since all pupils were included within a selected cluster, \(\pi \text{ pupil in cluster} = 1\). So, the probability of an individual pupil being selected is equal to the probability of a cluster (school class) being selected:
\[ \pi \text{ cluster} = \frac {n \text{ cluster}} {\sum (n \text{ cluster})} \]
where:
\(n \text { cluster}\) is the number of clusters included in the sample and;
\(\sum(n \text{ cluster})\) is the total number of clusters in the population.
Utility of simple proportions to estimate vaccine coverage:
The simple proportions in the sample have no real use, as the distribution in the sample does not reflect the distribution in the sampled population. To allow for different probabilities of selection of children or different levels of non-response, you need to weight the observations. An estimate for each stratum may be calculated by treating each stratum as a separate survey. A stratified estimate for the whole country may then be calculated by weighting the stratum estimates by the stratum populations. In other words, you need to use sampling weights for each stratum. The idea behind sampling weights is that they are inversely proportional to the sampling fractions. So, if you have under-sampled a particular stratum, you can give it a weight that is higher than average; and conversely if you have over-sampled a particular stratum, you give it a weight that is lower than average.
Sample weighting strategies:
There are several sampling weights that could be used in this study, such as:
the proportion of the 6 year-old population in each stratum divided by the total population of 6-year olds in the country (derived from the latest census).
the proportion of the total number of school classes in each stratum divided by the total number of school-classes in the country.
the proportion of first year Grammar school classes (6-year old pupils) in each stratum divided by the total first year Grammar school classes in the country
The researchers calculated sample weights for each stratum, according to the number of 6-year old pupils in each stratum, which was derived from the sampling frame. They preferred the weights based on number of children, because this allowed for different class sizes in different strata.
Calculating sampling weights:
Table 4.3
,
calculate the sampling fraction and sampling weights for each of the 11
strata and add them to the table.R coding tips:
Table 4.2 can be imported into R from the table4_2
worksheet in the Sampling_frames.xlsx
workbook. This table
contains the names, IDs and urban / rural status of the 11 strata, along
with the total student population from the sampling frame and the number
of students selected in each stratum.
The sampling fraction and weights can easily be calculated and added
in two new columns to the imported table with dplyr
using
the formulas below.
The sampling fraction is given by:
\[ Sample fraction = \frac{n}{N} \]
The sampling weights are the inverse of the sample fraction:
\[Weight = \frac{1}{(\frac{n}{N})}\]
###########################################
# CALCULATE SAMPLE FRACTIONS AND WEIGHTS
###########################################
# Import the summary information on the 11 strata (table 4.2):
strata_frame <- rio::import(here::here("data", "Sampling_frames.xlsx"),
which = "table4_2") %>%
# Add a new column and calculate the sample fractions for each stratum:
mutate(Sample_fraction = Selected_pupils / Total_pupils) %>%
# Add a new column and calculate the sample weights for each stratum:
mutate(Sample_weights = 1 / Sample_fraction)
Now we can print the calculated sample fractions and weights in a
flextable
:
Table 4.2 - Per-stratum sampling fractions and sample weights
# Show summary in a flextable:
tab42pub <- strata_frame %>%
# Convert to flextable:
flextable::qflextable() %>%
# Autofit column widths to text:
flextable::set_table_properties(width = 1, layout = "autofit")
# Print the table:
tab42pub
Stratum_id | Stratum_name | Area_type | Total_pupils | Selected_pupils | Sample_fraction | Sample_weights |
---|---|---|---|---|---|---|
11 | Ipiros | urban | 3,282 | 435 | 0.13254113 | 7.544828 |
12 | Ipiros | rural | 1,773 | 247 | 0.13931190 | 7.178138 |
21 | Aegean | urban | 8,609 | 477 | 0.05540713 | 18.048218 |
22 | Aegean | rural | 3,609 | 188 | 0.05209199 | 19.196809 |
31 | Macedonia | urban | 24,311 | 572 | 0.02352844 | 42.501748 |
32 | Macedonia | rural | 6,734 | 109 | 0.01618652 | 61.779817 |
41 | Attica | urban | 30,345 | 1,029 | 0.03391003 | 29.489796 |
51 | Sterea | urban | 12,928 | 512 | 0.03960396 | 25.250000 |
52 | Sterea | rural | 4,813 | 158 | 0.03282776 | 30.462025 |
61 | Thrace | urban | 3,069 | 388 | 0.12642555 | 7.909794 |
62 | Thrace | rural | 1,132 | 272 | 0.24028269 | 4.161765 |
Finally, we can add the calculated sample weights to the
vaccine
data set with a dplyr::left_join()
.
This function allows you to merge to datasets, in the case of
left_join()
it keeps all the rows from the first dataset
provided and tries to find matching rows to combine from the second one
provided. Read more about joins in the Epidemiologist R Handbook chapter
on joins.
# Add calculated fraction and weights to vaccine data set:
vaccine <- vaccine %>%
# Join the fraction and weights from strata_frame to the vaccine data set:
left_join(y = dplyr::select(.data = strata_frame, # Identify second data set
# Identify columns to add:
Stratum_id, # Column to match on in 2nd data
Sample_fraction, # Sample fractions to add to data
Sample_weights), # Sample weights to add to data
# Indicate ID columns to join by:
by = c("strata" = "Stratum_id"))
Background:
You will now calculate weighted vaccination coverage as proportions.
You can think of a weighted proportion as:
a weighted average of each observation of 0 and 1, where the weights for each stratum are equal to \(\frac {N_i}{n_i}\) (i.e. inversely proportional to the sampling fraction of the ith stratum), where \(N_i\) is the total population of the stratum \(i\) and \(n_i\) is the sample size of stratum \(i\); or
as a weighted average of the stratum specific proportions, where the weights are equal to \(N_i\) (i.e. proportional to the stratum population sizes).
The formula for any weighted average of y, using weight w, is:
\[ \text {weighted average} = \frac {\sum(w_i \cdot y_i)}{\sum(w_i)} \]
where:
\(w_i\) is the weight to apply and
\(y_i\) is the estimate to weight
So, there are two formulas that can be used:
Formula 4.1: \(\frac {\sum(\frac {N_i}{n_i} \cdot x_i)}{\sum(\frac {N_i}{n_i})}\) where the \(\sum\) is over all observations and \(x_i = 0\) or \(x_i = 1\), or;
Formula 4.2: \(\frac {\sum(N_i \cdot p_i)}{\sum(N_i)}\) where \(p_i\) is the stratum-specific proportion
The researchers aimed to obtain separate estimates for the urban and rural sectors of the population in each region (strata) as well as of the whole country.
Task description:
First, restrict the vaccine
data set to respondents
only (children for whom a vaccination booklet was available, i.e.
vaccrec == "Respondents"
).
Next, calculate the vaccination coverage of MMR-2 (variable name
mmr2yn
) for each stratum and add the results in a new
column to Table 4.2
(from the previous task).
Suppose there are only three strata (strata 11, 12 and 21) in the
country. Based on the information provided in Table 4.2
,
calculate the weighted proportion of children vaccinated with MMR-2 in
the country (three strata only). Use a handheld calculator or code in
R.
Finally, calculate the weighted proportion of children vaccinated with MMR-2 in the whole country, using all 11 strata.
R coding tips:
A. Preparing the data set:
First we need to convert the vaccrec
and
strata
variables to character class, as this is what the
survey
package expects.
We will also filter on the vaccrec
variable using
dplyr::filter()
and selecting only those records where
vaccrec == "Respondents"
.
###########################################
# PREPARE AND FILTER THE DATA SET
###########################################
# Create new table of weighted MMR-2 vaccine coverage by strata:
s44data <- vaccine %>%
# Convert vaccrec and strata to character:
mutate(vaccrec = as.character(vaccrec),
strata = as.character(strata)) %>%
# Filter for respondents only:
dplyr::filter(vaccrec == "Respondents")
B. Create the survey design:
You will need to set up a stratified survey design, which can be
achieved with the srvyr
package
as_survey_design()
function. This takes the following
arguments:
ids - this is for the variable containing cluster IDs - in this case we will not use it and set to 11
weights - enter the name of the
Sample_weights
variable here
strata - enter the name of the variable containing the
strata
here
# Create the survey design:
s44design <- s44data %>%
# Create survey design:
srvyr::as_survey_design(ids = 1,
weights = Sample_weights,
strata = strata)
C. Calculate weighted proportions (vaccine coverage estimates):
You can then apply this survey design object to the
tab_survey()
function from the sitrep
package,
which will calculate the weighted proportions, as shown below. Note that
the sitrep
package is not yet available on CRAN, but it can
be installed from Github (see package install section). For other ways
to calculate weighted proportions, have a look at the Survey
chapter in the Epidemiologist R handbook.
The sitrep::tab_survey()
function takes the following
arguments:
The data set with the survey design applied (can be piped in)
The variable to calculate proportions on
strata - the variable containing the strata
keep - which of the values or factor levels of the
variable of interest to calculate proportions for (for vaccination
coverage, only vaccinated individuals need to be kept,
e.g. mmr2yn == 1
)
method - which method to use to calculate the weighted
proportions. Note that the default method is logit
, but we
have selected the xlogit
method as this is closer to the
approach taken in STATA.
wide - whether to present the results in a wide or long
format table (long format is better for multiple variables; to set to
long format use wide = FALSE
).
deff - whether to include the design effect in the
results table (TRUE
) or not (FALSE
).
pretty - whether to convert results to percentages
(TRUE
) or not (FALSE
). We have selected
FALSE
because this is not the final results table and we
want to do some further formatting on the numbers.
A note on statistical methods:
The sitrep::tab_survey()
function uses methods to
calculate the proportions, confidence intervals and design effect from
the survey::svyciprop()
function. If you type
?survey::svyciprop
in the console to look at the help page
for this function, you will find further details on the different
methods available. The method we have chosen, xlogit
computes confidence intervals for the calculated proportions in the
following way:
Estimate the mean and standard error
Apply a logit transform to the mean
Use the delta method to transform the standard error estimate
Compute a confidence interval for the logit mean
Back-transform to the probability scale
Further discussion on this method and its origins is available in this StackExchange post.
# Calculate weighted vaccine coverage for MMR-2:
mmr2cov <- s44design %>%
# Create weighted proportions:
sitrep::tab_survey(mmr2yn, # The variable to calculate proportions on
strata = strata, # The variable containing the strata
keep = c(1), # The value of interest (mmr2 vaccinated)
method = "xlogit",# Log transformation of the mean
wide = FALSE, # Long format better for a lot of strata
deff = TRUE, # Include the design effect in the output
pretty = FALSE) %>% # Do not convert to percent (yet)
# Convert strata back to numeric for merging:
mutate(strata = strtoi(strata))
D. Calculate the weighted population fraction:
Next, we will merge the results (weighted proportions) to the
strata_frame
table we were working with earlier, using a
dplyr::left_join()
as before.
Then we can calculate the weighted population fraction as follows:
\[ \text {Weighted population fraction} = N_i \cdot p_i \]
where:
\(N_i\) is the total number of pupils in the study population per stratum and
\(p_i\) is the calculated weighted proportion of MMR-2 vaccinated children from the study.
We will store the weighted population fraction in a new variable
called pop_frac
.
###########################################
# CALCULATE TOTAL WEIGHTED VACCINE COVERAGE
###########################################
# Add the results to the strata_frame table:
strata_frame <- strata_frame %>%
# Add MMR-2 weighted vaccine coverage to strata_frame:
dplyr::left_join(y = dplyr::select(.data = mmr2cov,
strata,
proportion,
proportion_low,
proportion_upp),
by = c("Stratum_id" = "strata")) %>%
# Calculate weighted fraction of population:
mutate(pop_frac = Total_pupils * proportion)
E. Calculate total weighted proportions for the country:
You have been asked to first calculate the total weighted vaccination coverage estimate for the first three strata only, and then for all 11 strata. In both cases, this can be achieved by:
\[ \text {Total weighted VCE} = \frac {\sum(wp_i \cdot N_i)}{\sum(N_i)} \]
where:
\(wp_i\) are the weighted proportions for each stratum and
\(N_i\) is the population for each stratum
We can use the dplyr::slice()
function to extract the
first three rows of data to summarise (corresponding to the first three
strata).
Then we can use dplyr::summarise()
to calculate total
weighted VCE using the above formula.
Finally we can use dplyr::pull()
to pull the number out
of a data frame and then print the result.
Task 4.4 - total weighted vaccination coverage for three strata
################################################
# CALCULATE TOTAL WEIGHTED VACCINATION COVERAGE
################################################
# Calculate total VCE for the first 3 strata:
vc_total_03 <- strata_frame %>%
# Keep the first three rows:
slice(1:3) %>%
# Calculate the percentage:
summarise(round((sum(pop_frac) / sum(Total_pupils)) * 100, digits = 2)) %>%
# Get the number out of the data.frame:
pull()
# Print the results:
vc_total_03
## [1] 73.21
Task 4.4 - total weighted vaccination coverage for eleven strata
# Calculate total VCE for all 11 strata:
vc_total_11 <- strata_frame %>%
# Calculate the percentage:
summarise(round(sum(pop_frac) / sum(Total_pupils) * 100, digits = 2)) %>%
# Get the number out of the dataframe
pull()
# Print the results:
vc_total_11
## [1] 75.98
F. Present per-stratum weighted proportions in a table:
Finally, we will tidy up the format of the results (round up to two
decimal places etc. for easier reading) and publish the results to a
flextable
. We will not go into details about
flextable
formatting syntax here, but you can read more
about this in the Epidemiologist R handbook.
Table 4.4 - Weighted vaccination coverage estimates by region and area classification, Greece
###########################################
# DISPLAY STRATIFIED RESULTS IN A TABLE
###########################################
# format the contents of the table
mmr2pubtable <- strata_frame %>%
# Convert vaccine coverage estimates and 95% CIs to percent:
mutate(across(
.cols = tidyr::starts_with("proportion"),
.fns = ~ scales::percent(.x, accuracy = 0.01, suffix = NULL))) %>%
# Combine 95% CIs in one column:
tidyr::unite(col = "mmr2_95ci",
tidyr::starts_with("proportion_"),
sep = " - ",
remove = TRUE) %>%
# Select columns for the final table:
dplyr::select(1:7, proportion, mmr2_95ci) %>%
# Reduce Sample fraction to two decimal places for display:
mutate(across(
.cols = tidyr::starts_with("Sample_"),
.fns = ~ round(.x, digits = 2)))
We can then use the above tablea and print it using
flextable
.
# Create a pretty flextable of the results:
mmr2pubtable <- mmr2pubtable %>%
# Convert to a flextable:
flextable::flextable() %>%
# Add nice theme:
flextable::theme_booktabs(bold_header = TRUE) %>%
# Add labels:
flextable::set_header_labels(values = list(Stratum_id = "Stratum",
Stratum_name = "Area",
Area_type = "Type",
Total_pupils = "Population",
Selected_pupils = "Surveyed",
Sample_fraction = "Fraction",
Sample_weights = "Weights",
proportion = "MMR-2 VCE",
mmr2_95ci = "95% CI")) %>%
# Left-align first column (stratum ID):
flextable::align(j = 1, align = "left", part = "all") %>%
# Right-align last 2 columns (weighted VCE and 95% CI):
flextable::align(j = 8:9, align = "right", part = "all") %>%
# Adjust column widths to fit text:
flextable::set_table_properties(width = 1, layout = "autofit")
#################
# Print results
################
mmr2pubtable
Stratum | Area | Type | Population | Surveyed | Fraction | Weights | MMR-2 VCE | 95% CI |
---|---|---|---|---|---|---|---|---|
11 | Ipiros | urban | 3,282 | 435 | 0.13 | 7.54 | 67.97 | 63.11 - 72.47 |
12 | Ipiros | rural | 1,773 | 247 | 0.14 | 7.18 | 63.88 | 57.37 - 69.91 |
21 | Aegean | urban | 8,609 | 477 | 0.06 | 18.05 | 77.14 | 72.93 - 80.86 |
22 | Aegean | rural | 3,609 | 188 | 0.05 | 19.20 | 72.09 | 64.85 - 78.34 |
31 | Macedonia | urban | 24,311 | 572 | 0.02 | 42.50 | 78.24 | 74.52 - 81.55 |
32 | Macedonia | rural | 6,734 | 109 | 0.02 | 61.78 | 83.96 | 75.55 - 89.87 |
41 | Attica | urban | 30,345 | 1,029 | 0.03 | 29.49 | 76.67 | 73.68 - 79.41 |
51 | Sterea | urban | 12,928 | 512 | 0.04 | 25.25 | 72.14 | 67.86 - 76.05 |
52 | Sterea | rural | 4,813 | 158 | 0.03 | 30.46 | 66.21 | 58.04 - 73.51 |
61 | Thrace | urban | 3,069 | 388 | 0.13 | 7.91 | 82.57 | 78.04 - 86.33 |
62 | Thrace | rural | 1,132 | 272 | 0.24 | 4.16 | 75.00 | 69.19 - 80.03 |
Background:
You will now estimate the vaccination coverage of children for
different vaccines and in different settings, using
srvyr::as_survey_design()
to define the survey design, and
sitrep::tab_survey()
to estimate vaccination coverage in
R.
You will create and present your results in two tables, with the following objectives:
Table 4.5.1 - explore the impact of different survey designs on vaccine coverage estimates;
Table 4.5.2 - estimate vaccine coverage for different sub-populations in the study
Task description:
A - Table 4.5.1:
Calculate the proportion of children that were fully vaccinated
(using the variable vacful
) and corresponding 95%
confidence intervals with the following survey designs:
as if simple random sampling were used;
allowing for weights by strata
;
allowing for weights by strata
and clustering by
school
class;
allowing for weights, clustering and stratification (separate result for each stratum).
You will also need to calculate:
the design effect (for each estimate of vaccination coverage)
the intra-class correlation coefficient or \(rho\) (for the whole study).
B - Table 4.5.2:
Estimate vaccination coverage for all the vaccines in the data set separately (including the total sub-population \(N\), 95% CI and the design effect) for the following sub-groups:
overall population
different minority groups (minority
)
urban and rural areas (urban
)
Then perform the same calculations for these sub-groups, this time looking at:
complete vaccination (vacful
)
timeliness of vaccination (vactime
)
Combine your results in a table and discuss any differences you observe between sub-population groups, coverage for different vaccines, or other points of interest.
R coding tips:
We recommend you begin with the vaccine
data set that
you already imported and modified during the preceding tasks in session
4.
You will be using the following key R coding strategies to create the tables, most of which you have already encountered during previous tasks:
dplyr
package for all data manipulation tasks and
for piping %>%
lines of code together;
srvyr::as_survey_design()
to define the survey
design;
sitrep::tab_survey()
to estimate vaccination
coverage;
flextable::qflextable()
and other
flextable
functions to create and format results
tables;
purrr::map()
to iterate through each vaccine and
estimate vaccination coverage for each one.
The map()
functions from the purrr
package
(which is part of the tidyverse
) simplify the process of
looping through multiple variables and performing the same operations on
each of them (like a for loop).
Remember that the formula for calculating the design effect \(deff\) was presented in task 2.2. You can
use the same formula to solve for \(rho\) (the intra-cluster correlation
coefficient) by extracting the design effect from the output of the
sitrep::tab_survey()
command that you will be using to
estimate vaccine coverage.
Code for table 4.5.1:
A - simple random sampling:
In the first code chunk, we will set the survey design for simple
random sampling. Note that there is no design effect, because there are
no strata, clusters or weights in this design. We can represent this
with 1. We create a survey design object using
srvyr::as_survey_design()
, which is nothing other than a
dataframe with some extra information about clustering, weighting and
strata.
#####################################
# 01. SIMPLE RANDOM SAMPLING
#####################################
## Define a survey design object to be used for creating tables
vaccine_srs <- vaccine %>%
# Convert vaccrec to character:
mutate(vaccrec = as.character(vaccrec)) %>%
# Filter for respondents only:
dplyr::filter(vaccrec == "Respondents") %>%
# Create survey design for SRS:
srvyr::as_survey_design(ids = 1, # No cluster variable
weights = NULL, # No weights
strata = NULL) # No strata
We can then use our survey design object to get a proportions table
using sitrep::tab_survey()`. We will later combine the
below table to results of other survey design tables for comparison.
# Use this first result as the base row of table 4.5.1
# Results for the other survey designs can be appended to the same table
# Create new table of full vaccine coverage by (a) simple random sampling:
table451a <- vaccine_srs %>%
# Create proportions for fully vaccinated:
sitrep::tab_survey(vacful, # The variable to calculate proportions on
strata = NULL, # Simple random sampling has no strata
keep = c(1), # The value of interest (fully vaccinated)
method = "xlogit",# Use CDC SUDAAN method
wide = FALSE, # Long format better for tabulating results
deff = TRUE, # Include the design effect in the output
pretty = FALSE) %>% # Do not merge VCE and 95% CI yet
# Add a design column to describe each survey design:
mutate(design = "Simple random sampling") %>%
# Update the design effect where NA to 1 (representing no effect):
mutate(deff = replace_na(deff, 1))
B - Sampling with weights:
We can compute vaccine coverage with weights by repeating the steps
in the previous chunk, but this time including the
Sample_weights
variable in the survey design. The results
can then be appended to the previous table. Here too we first define a
survey object and then feed it to sitrep::tab_survey()
.
#####################################
# 02. SAMPLING WITH WEIGHTS
#####################################
## Define a survey design object to be used for creating tables
vaccine_weighted <- vaccine %>%
# Convert vaccrec to character:
mutate(vaccrec = as.character(vaccrec)) %>%
# Filter for respondents only:
dplyr::filter(vaccrec == "Respondents") %>%
# Create survey design for sampling with weights:
srvyr::as_survey_design(ids = 1, # No clustering
weights = Sample_weights, # This time include weights
strata = NULL) # No strata
As above, we use the survey object to create a table of proportions.
# Create new table of full vaccine coverage by (b) sampling with weights:
table451b <- vaccine_weighted %>%
# Create proportions for fully vaccinated:
sitrep::tab_survey(vacful, # The variable to calculate proportions on
strata = NULL, # Weighted sampling has no strata
keep = c(1), # The value of interest (fully vaccinated)
na.rm = TRUE, # Remove missing values
wide = FALSE, # Long format better for a lot of strata
deff = TRUE, # Include the design effect in the output
method = "xlogit",# Use CDC SUDAAN method
pretty = FALSE) %>% # Do not merge VCE and 95% CI yet
# Add a design column:
mutate(design = c("Sample weights"))
C - Sampling with weights and clustering:
As before, we can repeat the same code, this time adding
school
classes as clusters to the ids
argument
in the call to survey design:
#######################################
# 03. CLUSTER SAMPLING WITH WEIGHTS
#######################################
## Define a survey design object to be used for creating tables
vaccine_cluster <- vaccine %>%
# Convert vaccrec to character:
mutate(vaccrec = as.character(vaccrec)) %>%
# Filter for respondents only:
dplyr::filter(vaccrec == "Respondents") %>%
# Create survey design for sampling with weights:
srvyr::as_survey_design(ids = school, # Clustering by school class
weights = Sample_weights, # Use sampling weights
strata = NULL) # No strata
As above, we use the survey object to create a table of proportions.
# Create new table of full vaccine coverage by (C) cluster sampling with weights:
table451c <- vaccine_cluster %>%
# Create proportions for fully vaccinated:
sitrep::tab_survey(vacful, # The variable to calculate proportions on
strata = NULL, # Weighted sampling has no strata
keep = c(1), # The value of interest (fully vaccinated)
wide = FALSE, # Long format better for a lot of strata
deff = TRUE, # Include the design effect in the output
method = "xlogit",# Use CDC SUDAAN method
pretty = FALSE) %>% # Do not merge VCE and 95% CI yet
# Add a design column:
mutate(design = c("Sample weights + clusters"))
D - Cluster sampling with weights and strata:
We can again repeat the same code, this time adding the
strata
to the survey design. We will not add strata to the
tab_survey()
command, however, as to complete table 4.5.1,
we need the total population estimate rather than the estimates for each
stratum. The stratification is however taken into account in the survey
design.
We will also round up the results to make them easier to read, and
print them to a flextable()
.
#########################################
# 04. CLUSTER SAMPLING, WEIGHTS & STRATA
#########################################
## Define a survey design object to be used for creating tables
vaccine_strata <- vaccine %>%
# Convert vaccrec to character:
mutate(vaccrec = as.character(vaccrec)) %>%
# Convert strata to character:
mutate(strata = as.character(strata)) %>%
# Filter for respondents only:
dplyr::filter(vaccrec == "Respondents") %>%
# Create survey design for sampling with weights, clusters & strata:
srvyr::as_survey_design(ids = school, # Cluster by school class
weights = Sample_weights, # Include sample weights
strata = strata) # Include strata
As above, we use the survey object to create a table of proportions.
# Create table of full vaccine coverage by (d) cluster sampling, weights & strata:
table451d <- vaccine_strata %>%
# Create proportions for fully vaccinated:
sitrep::tab_survey(vacful, # The variable to calculate proportions on
strata = NULL, # To get the total we will leave this NULL
keep = c(1), # The value of interest (fully vaccinated)
wide = FALSE, # Long format better for a lot of strata
deff = TRUE, # Include the design effect in the output
method = "xlogit",# Use CDC SUDAAN method
pretty = FALSE) %>% # Do not merge VCE and 95% CI yet
# Add a design column:
mutate(design = c("Sample weights + clusters + strata"))
We can then combine all of the tables above to create one overview
table and compare results of the different designs. We do this using the
dplyr::bind_rows()
function, which simply places on
dataframe on top of the other.
# Bind results together in one table:
table451 <- dplyr::bind_rows(table451a, # VCE with Simple random sampling
table451b, # VCE with weights
table451c, # VCE with weights + clustering
table451d) # VCE with weights + clustering + strata
Now that the table is complete, we can tidy up the decimal points and convert the estimates and confidence intervals to percentages to make them easier to read.
# Create publishable table of results:
table451 <- table451 %>%
# Now we can convert the estimates and 95% CI to percentages:
mutate(
across(
# for each column which has proportion in the name
contains("proportion"),
# convert to a percentage
~ scales::percent(.x, accuracy = 0.01))) %>%
# And finally we can trim the decimal places for other columns:
mutate(deff = round(deff, digits = 2)) %>%
# Create merged column for 95% CIs:
tidyr::unite(
# name the new column CI95
col = "CI95",
# state which two columns to combine
proportion_low, proportion_upp,
# separate the values with a dash
sep = " - ",
# drop the old columns
remove = TRUE
) %>%
# Select the columns to print:
dplyr::select(design,
deff,
n,
proportion,
CI95)
Finally, we can use flextable
as previously to make a
publishable table.
Table 4.5.1 - Vaccination coverage estimates with different survey designs
table451pub <- table451 %>%
# Lastly we can convert it to a flextable for printing:
flextable::qflextable() %>%
# Add labels:
flextable::set_header_labels(values = list(design = "Survey design",
deff = "Design effect",
n = "VCE (n)",
proportion = "VCE (%)",
CI95 = "95% CI")) %>%
# Right-align last 2 columns (weighted VCE and 95% CI):
flextable::align(j = 4:5, align = "right", part = "all") %>%
# Adjust column widths to fit text:
flextable::set_table_properties(width = 1, layout = "autofit")
##################
# Print results
#################
table451pub
Survey design | Design effect | VCE (n) | VCE (%) | 95% CI |
---|---|---|---|---|
Simple random sampling | 1.00 | 2,350.00 | 60.99% | 59.44% - 62.52% |
Sample weights | 1.41 | 55,831.77 | 62.96% | 61.17% - 64.71% |
Sample weights + clusters | 2.37 | 55,831.77 | 62.96% | 60.62% - 65.23% |
Sample weights + clusters + strata | 2.36 | 55,831.77 | 62.96% | 60.63% - 65.23% |
Interpreting the design effects:
Note that we have combined the different survey designs here for the purpose of the case study. In a real scenario you would only use the design that was used during data collection. For example if clusters and strata were used then you would present that result (and not the others).
Estimations are modified when the sampling design is taken into account:
allowing for the sample weight modifies the estimate of vaccination coverage
allowing for the clustering (or multistage design) decreases the precision of the estimate (higher variance and design effect)
allowing for the stratification improves the precision of the estimate (lower variance and design effect)
Note that the point estimate depends on the weights only, while the 95% CIs depend on everything (weights, stratification and clustering).
A design effect of 1.346 suggests that the variability (variance or the square of the standard error) of the estimate under the chosen design is 34.6% larger than that of the same-sized simple random sampling. Similarly, a design effect of 2.255 suggests that the variability of the estimate allowing for clustering, stratification and sampling weights is 125.5% larger than that would come from the same sample size if simple random sampling were used.
E - Estimating the intra-class correlation coefficient:
We can solve for \(rho\) (intra-class correlation coefficient) using the equation that was presented in task 2.2, with the following values as input:
Mean school class (cluster) size: 20
Design effect from the weighted, stratified cluster sampling: 2.36
This can be calculated as follows:
\[ rho = \frac {(deff - 1)} {(n -1)} \]
where:
\(rho\) is the intra-class correlation coefficient
\(deff\) is the design effect
\(n\) is the mean size of clusters
# Calculate ICC:
study_rho <- (2.36 - 1) / (20 - 1)
# Print result:
study_rho
## [1] 0.07157895
As mentioned earlier, \(rho\) is the proportion of the total variation in the outcome that is between clusters; this measures the degree of similarity or correlation between subjects within the same cluster. The larger \(rho\) is (the tendency for subjects within a cluster to be similar), the greater the size of the design effect and the larger the number of additional subjects required to achieve the same precision.
Note that in the sample size calculations in session 2, the design effect and \(rho\) were expected to be higher (2 and 0.05, respectively). Hence, the sample size actually achieved was slightly smaller and the estimates were a bit less precise than originally expected.
Code for table 4.5.2
In this section, we will include weights, clusters and strata in all the survey designs and estimate vaccine coverage (n and %), 95% confidence intervals and the design effect for:
Each vaccine, separately : dtp3yn, dtp4yn, dtp5yn, mmr1yn, mmr2yn, hibprmyn, hibfulyn, hbv3yn, mnc1yn, pne1yn and var1yn) ;
For DTP-3 (dtp3yn) also stratify by minority and urban/rural residence;
For complete vaccination (vacful
) also stratify by
minority and urban/rural residence;
For timeliness of vaccination (vactime
) also
stratify by minority and urban / rural residence.
We will then combine the results in a single table as for the previous section.
Most of this code is repetitive (as you repeat it for each strata).
We will first demonstrate how to use the same comands and then combine
the outputs. However we will also demonstrated how to use the
map()
function from the purrr
package to
iterate (repeat) code for each of this variables. This may seem scary at
first, but try to remember that we are just using the functions from
previous sessions and running them multiple times, once for each
variable. purrr
will store the output in a list object, but
using bind_rows()
we can easily and simply combine these in
to a single data frame.
We will first create the survey design (which is the same for the whole table):
vce_design <- vaccine %>%
# Convert vaccrec to character:
mutate(vaccrec = as.character(vaccrec)) %>%
# Convert strata to character:
mutate(strata = as.character(strata)) %>%
# Filter for respondents only:
dplyr::filter(vaccrec == "Respondents") %>%
# Create survey design for sampling with weights, clusters & strata:
srvyr::as_survey_design(ids = school,
weights = Sample_weights,
strata = strata)
Next, we will create the first results for DTP3 (3-dose anti-
diphtheria, tetanus and pertussis vaccine) stratified by minority group
and then by urban / rural area of residence using the
tab_survey()
function from the sitrep
package.
We will first demonstrate how to repeat the code for each strata and
then demonstrate using purr::map()
to loop over the two
stratifiers.
Note that strata
in the survey design refers to how the
sampling protocol was stratified - this is distinct from the
strata
argument in tab_survey()
, which will
calculate separate vaccine coverage estimates for each level of the
stratifying variable.
#############################################
# Estimate VC for DTP3 by minority and urban:
#############################################
# using the study design object
minority_table <- vce_design %>%
# Create proportions for DTP3 stratified by minority:
sitrep::tab_survey(dtp3yn,
strata = minority,
keep = c(1),
na.rm = TRUE,
method = "xlogit",
wide = FALSE,
deff = TRUE,
pretty = FALSE) %>%
# Rename minority column to stratum (for merging tables)
rename(stratum = minority) %>%
# Remove row with missing values:
filter(!is.na(stratum))
# using the study design object
urban_table <- vce_design %>%
# Create proportions for DTP3 stratified by urban/rural:
sitrep::tab_survey(dtp3yn,
strata = urban,
keep = c(1),
na.rm = TRUE,
method = "xlogit",
wide = FALSE,
deff = TRUE,
pretty = FALSE) %>%
# Rename urban column to stratum (for merging tables)
rename(stratum = urban) %>%
# Remove row with missing values:
filter(!is.na(stratum))
## combine in to a single dataframe
restab452a <- bind_rows(minority_table, urban_table)
We can create the same table using purrr::map()
, but
without having to repeat the same code for each strata.
####################################################
# Loop - Estimate VC for DTP3 by minority and urban:
####################################################
restab452a <- purrr::map(
# for each of the variables listed
.x = c("minority", "urban"),
# run the following function (replacing x with each variable one-by-one)
.f = function(x) {
# using the study design object
vce_design %>%
# Create proportions for DTP3 stratified by the current variable being run:
sitrep::tab_survey(dtp3yn,
# purrr replaces x with "minority" and then with "urban"
# x is in curly brackets so it is interpreted as a
# variable name instead of a character
strata = {x},
keep = c(1),
na.rm = TRUE,
method = "xlogit",
wide = FALSE,
deff = TRUE,
pretty = FALSE) %>%
# Rename minority column to stratum:
# Note that here we dont need curly brackets
# (because we want it to be a character)
rename(stratum = x) %>%
# Remove row with missing values:
filter(!is.na(stratum))
}
) %>%
## pull the results out of lists and in to a single dataframe
bind_rows()
For the next set of variables to tabulate, we do not require stratification. Here too, you could type out the code for each of the tables (as done above for minority and urban/rural). In this case, instead of iterating over the stratifying variable, we can iterate over each of the vaccines to get an overall estimate for each.
# Create a list of variables (vaccines) to estimate vaccine coverage with:
varlist <- c("dtp4yn",
"dtp5yn",
"mmr1yn",
"mmr2yn",
"hibprmyn",
"hibfulyn",
"hbv3yn",
"mnc1yn",
"pne1yn",
"var1yn")
restab452b <- purrr::map(
# for each of the variables listed
.x = varlist,
# run the following function (replacing x with each variable one-by-one)
.f = function(x) {
vce_design %>%
# purrr replaces x with each of the vaccines listed
# x is in curly brackets so it is interpreted as a
# variable name instead of a character
sitrep::tab_survey({x},
strata = NULL,
keep = c(1),
na.rm = TRUE,
method = "xlogit",
wide = FALSE,
deff = TRUE,
pretty = FALSE) %>%
# Create stratum column:
mutate(stratum = "Overall")
}
) %>%
bind_rows()
Next, we can look at vaccine completeness, stratified by minority
group and urban/rural residence. Here too, you could type out the code
for each of the tables (as done above for minority and urban/rural). We
repeat the above process to run each of the variables through the strata
argument of the tab_survey()
function from
sitrep
.
restab452c <- purrr::map(
.x = c("minority", "urban"),
.f = function(x) {
vce_design %>%
sitrep::tab_survey(vacful,
# purrr replaces x with "minority" and then with "urban"
# x is in curly brackets so it is interpreted as a
# variable name instead of a character
strata = {x},
keep = c(1),
na.rm = TRUE,
method = "xlogit",
wide = FALSE,
deff = TRUE,
pretty = FALSE) %>%
# Rename minority column to stratum:
rename(stratum = x) %>%
# Remove row with missing values:
filter(!is.na(stratum))
}
) %>%
bind_rows()
Finally we can repeat the same process for vaccination timeliness.
Note that we could have saved some repeating code by nesting
map()
arguments, to iterate over both vaccines and strata
groups. For simplicity, we have not done this, but if you have extra
time - try this out.
restab452d <- purrr::map(
.x = c("minority", "urban"),
.f = function(x) {
vce_design %>%
sitrep::tab_survey(vactime,
# purrr replaces x with "minority" and then with "urban"
# x is in curly brackets so it is interpreted as a
# variable name instead of a character
strata = {x},
keep = c(1),
na.rm = TRUE,
method = "xlogit",
wide = FALSE,
deff = TRUE,
pretty = FALSE) %>%
# Rename minority column to stratum:
rename(stratum = x) %>%
# Remove row with missing values:
filter(!is.na(stratum))
}
) %>%
bind_rows()
We can then combine all of the tables above again using the
dplyr::bind_rows()
function.
# combine each of the result tables in to one
restabpub <- bind_rows(restab452a,
restab452b,
restab452c,
restab452d
) %>%
# to be able to ignore the value column (which only contains 1s)
ungroup()
We can then tidy up the decimals and add percentages to make the results easier to read, as before.
# Create publishable table of results:
restabpub <- restabpub %>%
# Now we can convert the estimates and 95% CI to percentages:
mutate(
across(
# for each column which has proportion in the name
contains("proportion"),
# convert to a percentage
~ scales::percent(.x, accuracy = 0.01))) %>%
# And finally we can trim the decimal places for other columns:
mutate(across(c(n, deff),
~ round(.x, digits = 2))) %>%
# Create merged column for 95% CIs:
tidyr::unite(
# name the new column CI95
col = "CI95",
# state which two columns to combine
proportion_low, proportion_upp,
# separate the values with a dash
sep = " - ",
# drop the old columns
remove = TRUE
) %>%
# Select the columns to print:
dplyr::select(variable,
stratum,
deff,
n,
proportion,
CI95) %>%
# Recode variable names to nice labels for printing (see data dictionary):
mutate(variable = dplyr::recode(variable,
dtp3yn = "DTP: 3 doses",
dtp4yn = "DTP: 4 doses",
dtp5yn = "DTP: 5 doses",
mmr1yn = "MMR: 1 dose",
mmr2yn = "MMR: 2 doses",
hibprmyn = "HiB: partial",
hibfulyn = "HiB: full",
hbv3yn = "HBV: 3 doses",
mnc1yn = "Men C: 1 dose",
pne1yn = "Pneumo: 1 dose",
var1yn = "Varicella: 1 dose",
vacful = "Fully vaccinated",
vactime = "Timely vaccination"))
Finally, we can use flextable
as previously to make a
publishable table.
Table 4.5.2 - Vaccination coverage estimates stratified by population sub-groups
restabpub <- restabpub %>%
# Lastly we can convert it to a flextable for printing:
flextable::qflextable() %>%
# Merge cells in the vaccine colum as the values are repeated:
flextable::merge_v(j = c("variable"), target = c("variable")) %>%
# Add labels:
flextable::set_header_labels(values = list(variable = "Vaccine",
stratum = "Stratum",
deff = "Design effect",
n = "VCE (n)",
proportion = "VCE (%)",
CI95 = "95% CI")) %>%
# Right-align last 2 columns (weighted VCE and 95% CI):
flextable::align(j = 5:6, align = "right", part = "all") %>%
# Add borders under the merged cells:
flextable::hline(i = rle(cumsum(.$body$spans$columns[,1] ))$values,
border = fp_border(style = "solid", width = 1)) %>%
# Fix thickness of bottom border:
fix_border_issues() %>%
# Add bottom border back:
flextable::hline_bottom(border = fp_border(style = "solid", width = 2),
part = "body") %>%
# Adjust column widths to fit text:
flextable::set_table_properties(width = 1, layout = "autofit")
#################
# Print results
################
restabpub
Vaccine | Stratum | Design effect | VCE (n) | VCE (%) | 95% CI |
---|---|---|---|---|---|
DTP: 3 doses | General population | 2.97 | 78,513.23 | 99.52% | 99.12% - 99.74% |
Roma | 4.04 | 806.59 | 79.65% | 55.95% - 92.35% | |
Greek Muslims | 2.34 | 1,597.77 | 96.31% | 91.77% - 98.39% | |
Immigrants | 2.80 | 7,368.36 | 98.99% | 96.40% - 99.72% | |
Rural | 6.97 | 16,653.61 | 98.37% | 96.76% - 99.19% | |
Urban | 6.97 | 71,969.94 | 99.39% | 98.93% - 99.65% | |
DTP: 4 doses | Overall | 1.91 | 87,865.44 | 98.35% | 97.70% - 98.81% |
DTP: 5 doses | Overall | 1.71 | 80,472.99 | 90.07% | 88.80% - 91.22% |
MMR: 1 dose | Overall | 1.75 | 87,813.85 | 98.29% | 97.67% - 98.75% |
MMR: 2 doses | Overall | 3.10 | 67,899.19 | 76.00% | 73.60% - 78.24% |
HiB: partial | Overall | 1.45 | 80,408.32 | 90.00% | 88.83% - 91.06% |
HiB: full | Overall | 1.51 | 76,325.71 | 85.43% | 84.04% - 86.72% |
HBV: 3 doses | Overall | 1.75 | 85,147.72 | 95.30% | 94.36% - 96.10% |
Men C: 1 dose | Overall | 2.76 | 64,162.29 | 71.82% | 69.45% - 74.07% |
Pneumo: 1 dose | Overall | 1.74 | 4,095.83 | 4.58% | 3.80% - 5.52% |
Varicella: 1 dose | Overall | 2.65 | 11,595.27 | 12.98% | 11.38% - 14.76% |
Fully vaccinated | General population | 2.14 | 52,238.06 | 66.61% | 64.20% - 68.94% |
Roma | 3.40 | 245.60 | 24.25% | 9.13% - 50.52% | |
Greek Muslims | 1.85 | 849.63 | 51.22% | 40.56% - 61.76% | |
Immigrants | 1.71 | 2,197.89 | 30.31% | 25.03% - 36.18% | |
Rural | 6.38 | 9,610.30 | 57.06% | 50.56% - 63.33% | |
Urban | 6.38 | 46,221.47 | 64.34% | 61.85% - 66.75% | |
Timely vaccination | General population | 1.64 | 42,600.03 | 55.39% | 52.85% - 57.91% |
Roma | 0.80 | 18.05 | 2.15% | 0.22% - 17.75% | |
Greek Muslims | 1.44 | 491.35 | 30.52% | 23.30% - 38.85% | |
Immigrants | 1.69 | 1,232.06 | 16.96% | 12.39% - 22.77% | |
Rural | 4.94 | 7,773.62 | 47.46% | 40.45% - 54.57% | |
Urban | 4.94 | 36,766.93 | 52.12% | 49.48% - 54.74% |
Summarizing the results:
Vaccination coverage was high for vaccines included in the NVP before 1990 (i.e. DTP, Po, MMR). However, the uptake of 2 doses of MMR was much lower than the WHO target of 95%.
Vaccination coverage was satisfactory for Hib and HepB that were introduced in the NVP in 1998 and 2002, respectively.
Uptake for vaccines introduced in the NVP in 2006 ranged from high (MNC that was available in the pharmacies earlier) to very low (PCV7 and Var)
Vaccination uptake was lower among Greek Roma and to a lesser degree among immigrants
There were no substantial differences in the vaccination coverage in urban and rural areas.
Types of practices administering vaccines:
The investigators also estimated the weighted proportions of practices administering MMR-1 and other vaccines. Almost 70% of MMR-1 vaccines were administered in private practices. Approximately 90% of the first doses of MMR among Greek Roma were administered in state practices, with the corresponding proportions among immigrants, Greek Muslims and the general population being 53%, 45% and 26%, respectively. Similar patterns were observed for most, but not all, vaccines.
Background to the analytical study:
During the study, parents/ guardians of participating children were asked to complete a self-administered questionnaire, indicating their agreement with statements on perceived benefits or harms of and barriers to immunization on a four-point scale, with possible responses ranging from “completely agree” to “completely disagree”. These answers were transformed to ordinal numeric scores between 1 and 4, with the maximum score corresponding to “completely agree” for the positively worded statements or “completely disagree” for the negatively worded statements.
Definition of positive opinion towards vaccination:
A balanced positive opinion towards immunization was defined as agreement with the following two positively worded statements:
“Vaccinations are necessary for my child” and
“I keep my child’s vaccination up to date, according to scientific advice”
and disagreement with the following two negatively worded statements:
“I fear that vaccines may harm my child” and
“Natural childhood disease is certainly preferable to vaccination”
The scores from each of these statements were summed up, with higher scores denoting more positive beliefs regarding vaccination. Based on these scores, balanced positive opinion was classified as strong (score > 7.5), moderate (score between 6.0 and 7.5) and weak (score <6.0).
Uncritical positive opinion towards immunization was defined as agreement with the following two statements:
“Children should be immunized immediately with every newly licensed vaccine” and
“Vaccines are completely safe and never harm child’s health”
Other statements addressed issues regarding newly licensed vaccines, perceived safety of vaccination, accuracy of recommendations, financial profit and perceived financial, structural or health care service barriers to vaccination.
Session overview
In this session, you will describe beliefs and attitudes of parents/guardians towards immunization, their perceived barriers to vaccination and their socioeconomic characteristics. You will subsequently identify which of those factors are associated with low vaccination coverage.
This session consists of the following tasks:
Task 5.1 - Describe parental characteristics, beliefs towards vaccination and perceived barriers to immunization
Task 5.2 - Identify unadjusted (crude) associations with having a fully vaccinated child
Task 5.3 - Identify adjusted associations of parental characteristics, parental beliefs towards vaccination and perceived barriers to immunization with having a fully vaccinated child
To complete the work in this session, you will need:
The database vaccine5.dta
that is derived from the three
data sets (vaccine.dta
, school.dta
and
quest.dta
); quest.dta
contains information
from the self-administered questionnaire completed by parents/guardians
regarding beliefs and attitudes towards immunization, perceived barriers
to vaccination and parental socioeconomic characteristics. A description
of the variables of the three datasets is provided in the appendix. All
these datasets have been simplified for the purpose of this case
study.
The researchers are interested in finding out if there is any correlation between parental beliefs and the vaccination status of their child. The following table lists the variables you will be working with in this session and any recoding or transformation required:
Characteristic or belief | Variable name | Class | Transform |
---|---|---|---|
Sample weights (numeric value) | weight | Numeric | - |
Urban / rural region (numeric stratum ID) | strata | Factor | Convert to factor |
School class (numeric cluster ID) | school | Factor | Convert to factor |
Fully vaccinated status (outcome measure) | vacful | Binary (0, 1) | - |
Minority group status | minority | Factor | New ref: Roma |
Mother’s age at child’s birth | mage | Factor | <25, 25-29, >=30 |
Number of other siblings | osib1 | Factor | >=3, 1-2, 0 |
Father’s education (no. years) | educf | Factor | Convert to factor |
Balanced positive opinion on vaccination (L) | a1posg | Factor | Convert to factor |
Balanced positive opinion on vaccination (B) | a1posyn | Factor | Convert to factor |
Uncritical positive opinion on vaccination | u ncritical | Factor | Convert to factor |
Vaccination is necessary for my child | a1x1g | Factor | Convert to factor |
Vaccines may harm my child | a1x2g | Factor | Convert to factor |
Natural infection preferable to vaccination | a1x5g | Factor | Convert to factor |
Barrier: long distance to immunization site | a3x1g | Factor | Convert to factor |
Barrier: inconvenient opening hours | a3x2g | Factor | Convert to factor |
Barrier: high cost of vaccines | a3x5g | Factor | Convert to factor |
You will use te first three variables (weight
,
strata
and school
) to create the survey
design, as before. In task 5.3 where the survey design is not needed,
school
will still be used to define the clusters.
Throughout this session, you will use the binary variable
vacful
as the outcome measure. Note that this variable is
already in the correct format for inclusion in the analytic tasks (1 =
fully vaccinated).
The next four variables (minority group, mother’s age at child’s birth, number of other siblings and father’s education in years) represent key characteristics about the parents or family of the child who has been enrolled in the study. Each of these variables will require some transformation before analysis, as they will be easier to interpret as categorical factors (ordered for the last three).
Variables that are currently classified as numeric and continuous (e.g. mother’s age at child’s birth) will need to be converted to factors with each level / group defined as detailed in table 5.0 above.
For each factor, you may wish to consider how to order the factor levels and also which level to assign as the reference, to best facilitate interpretation. Specific instructions on this are provided in the relevant tasks.
The remaining 9 variables represent statements on beliefs about vaccination and barriers to vaccination. Parents were asked to indicate whether they agreed or disagreed with each statement, on a Likhert scale. Eight of these variables have been converted to binary answers (0 = no / disagree, 1 = yes / agree) for convenience. Note that the question on balanced positive opinion on vaccination is represented in two different ways; Likhert (no opinion, moderate, strong) and binary (no / yes). There is likely to be some autocorrelation or co-linearity between these two variables, which you will explore in task 5.3, when choosing what to include in the model.
As before, please refer to the following appendices for full details on how these variables are coded:
Annex 5: school
data set (contains basic demographic
variables on recruited children)
Annex 6: vaccine
data set (contains the vaccine
records of recruited children)
Annex 7: quest
data set (encoded, scaled answers to
the parental questionnaire on vaccine beliefs)
Background:
In this task, you will perform descriptive analysis and calculate the survey design-adjusted proportions of study children by certain parental characteristics;
You will present your results in a new table
(table 5.1
). Consider the implications of these proportions
on study power before estimating vaccine coverage in the next task.
Task description:
Using the data set vaccine5.dta
you will:
Create an age group variable for mother’s age (mage
) as follows: (<25, 25-29, >=30 years)
Create categories for number of other siblings in the family
(osib1
) : (>=3, 1-2, 0)
Create appropriate labels for the recoded variables
Estimate the proportions for each of the parental characteristics listed below, using a survey design that takes into account the weights, clustering and strata that were used during the sampling.
The parental characteristics to estimate proportions for in table 5.1 are as follows:
Variable description | Variable name | Comments |
---|---|---|
Minority group | minority |
Convert to factor and add factor level labels |
Mother’s age at childbirth | mage |
Create age groups: <25, 25-29, >=30 |
Number of other siblings | osib1 |
Divide into categories: >=3, 1-2, 0 |
R coding tips:
A - Data import and preparation
First, we will import the data vaccine5.dta
which you
can find in the data
folder, as before.
Because we will use the same data set for the rest of this session, we will also prepare (recode or reclassify, as necessary) the other variables required for tasks 5.2 and 5.3 in this data preparation step. The purpose of the recoding is to make the results easier to interpret in the models.
01. Reclassifying categorical variables as factors
Note that several categorical variables (such as
minority
) are currently interpreted by R as being numeric.
We can update these variables to labelled factors using the base R
function factor()
. Refer to appendices 5, 6 and 7 to obtain
the labels for the different factor levels in each variable.
02. Converting numeric variables to ordered categories
We can create ordered categorical versions of the variables
mage
(mother’s age group), osib1
(number of
siblings) and educf
(number of years of father’s education)
using the dplyr function case_when()
. We will supply labels
for each level of the new variables as character strings.
03. Reordering factor levels
Next, we will convert the two new categorical variables to an ordered
factor, by using the forcats::fct_relevel()
function. You
can check that the factor levels are in the correct order with
levels()
.
Note:
In the code below, the recoding of all the variables has been piped
together using %>%
so that the data set only has to be
imported and named once, at the top of the piping chain. This is for
convenience and is useful practice to reduce duplication in your code,
thus making it more readable.
However, if you prefer to see the impact of each step on the data set
separately, you could break the chain into separate steps (piping from
the vaccine
data set for each step). You can also highlight
the existing piped code to just before the relevant pipe operator and
press Ctrl + Enter
to run e.g. just the first step, the
first and second step or the first second and third step etc.
############################################################
# 02. IMPORT DATA, CONVERT TO FACTORS, LABEL & ORDER LEVELS
############################################################
# Import vaccine5.dta:
vaccine <- rio::import(here::here("data", "vaccine5.dta")) %>%
# Convert minority to a labelled factor:
mutate(minority = factor(minority,
levels = c(0, 1, 2, 3),
labels = c("General population",
"Roma",
"Greek Muslims",
"Immigrants"))) %>%
# Create new categorical variable (agegroup) from mother's age:
mutate(magegroup = case_when(
# Logical criteria Label
mage < 25 ~ "Under 25",
between(mage, 25, 29) ~ "25 - 29",
mage >= 30 ~ "30 or over")) %>%
# Convert to a factor and put the age groups in ascending order:
mutate(magegroup = forcats::fct_relevel(magegroup,
"Under 25",
"25 - 29",
"30 or over")) %>%
# Create new categorical variable for number of siblings:
mutate(nsibcat = case_when(
# Logical criteria Label
osib1 >= 3 ~ "3 or more",
between(osib1, 1, 2) ~ "1 - 2",
osib1 == 0 ~ "No siblings")) %>%
# Convert to a factor and put the categories in descending order:
mutate(nsibcat = forcats::fct_relevel(nsibcat,
"3 or more",
"1 - 2",
"No siblings")) %>%
# Convert school (numeric IDs) to a factor:
mutate(school = factor(school)) %>%
# Create a labeled factor from educf for father's education:
mutate(educf = factor(educf,
levels = c(2, 3, 4, 6),
labels = c("Grammar school (< 9 years)",
"High school (9 - 11 years)",
"College (12 years)",
"University (> 12 years)"))) %>%
# Create labelled factor for a1posg (balanced positive opinion, 3 categories):
mutate(a1posg = factor(a1posg,
levels = c(0, 1, 2),
labels = c("No positive opinion",
"Moderate positive opinion",
"Strong positive opinion")))
04. Bulk processing: reclassifying a list of variables as factors
Lastly, there are a number of additional variables to include in
tasks 5.2 and 5.3, where parental responses to questions about
vaccination have been recorded as binary (no = 0, yes = 1). In
preparation for the analysis, we will recode these as factors. Because
they all have the same number of factor levels (n = 2) and
because the two factor levels have the same labels (‘no’ or ‘yes’), we
can do this conversion in one go, using the dplyr convenience function
across()
with the list of variables to convert.
To facilitate this process, we will first create a list containing the names of all of the binary variables, as character strings. We will also create another list that contains both the binary variables and the other explanatory variables listed in table 5.0 that we want to iterate through in each analysis step.
We can then put the original variable names and a column of apply
human readable labels in a data frame. Later, we can use the
matchmaker::match_vec()
function to recode, and
automatically annotate results tables (so that the variable / question
label appears in tables instead of the variable name).
#############################
# 01. CREATE VARIABLE LISTS
#############################
# First we create a list of binary variables to label:
binary_vars <- c("a1posyn",
"uncritical",
"a1x1g",
"a1x2g",
"a1x5g",
"a3x1g",
"a3x2g",
"a3x5g")
# Now iterate through the list of binary variables to convert them to factors:
vaccine <- vaccine %>%
# Mutate across list of binary variables to convert all to labelled factors:
mutate(across(
.cols = c(all_of(binary_vars)), # Columns to loop through: all binary_vars
.fns = ~factor(.x, # Function to apply: convert to factor
levels = c(0, 1), # Stipulate levels from existing values
labels = c("no", "yes")))) # Label levels as no and yes
# Next, create a list of all variables to analyse:
allvars <- c("minority",
"magegroup",
"educf",
"nsibcat",
"a1posg",
binary_vars)
# Create a list of nice labels for allvars:
allvarlabs <- c("Minority group",
"Mother's age at childbirth",
"Father's education",
"Number of siblings",
"Balanced positive opinion on vaccination: likhert",
"Balanced positive opinion on vaccination: binary",
"Uncritical positive opinion on vaccination",
"Vaccination is necessary for my child",
"Vaccines may harm my child",
"Natural disease preferable to vaccines",
"Long distance to immunization site",
"Inconvenient opening hours",
"Cost of vaccines")
# Create a dataframe of original variable names and labels to use in recoding:
recoders <- cbind(allvars, allvarlabs)
B - Survey design
We will then create the survey design, including weights, clustering
and strata, using srvyr::as_survey_design()
as before. You
will need the following variables to do this:
weight
: contains sampling weights as calculated
earlier;strata
: contains the 11 strata (urban/rural areas
within the 6 regions);school
: contains the ID number for each school class
(cluster).vce_design <- vaccine %>%
# Create survey design:
srvyr::as_survey_design(ids = school, # This is the cluster variable
weights = weight, # These are the sample weights
strata = strata) # These are the 11 strata
C - Estimate stratified proportions:
Here we will calculate estimated proportions of each factor level in
minority group, mother’s age group and number of siblings (in
categories), taking into account the survey design. We will use the
sitrep::tab_survey()
function to estimate the proportions,
and purrr::map()
to loop through the three variables of
interest, minority
, magegroup
and
nsibcat
.
Notes:
Estimated proportions may be bigger or smaller than the actual proportions in the sample, because the survey design is used to estimate what the proportions in the population at large are likely to be.
In the sitrep::tab_survey()
function we will set the
argument keep
to TRUE
, since this time we
want to calculate relative proportions for all the different factor
levels in each variable. This is in contrast to when we were calculating
vaccine coverage in the previous section; where we set
keep = c(1)
to indicate that we only wanted to calculate
the proportion of vaccinated individuals (in all the vaccine variables,
vaccinated = 1
, while
not vaccinated = 0
).
# Create a list of variables:
varlist <- c("minority",
"magegroup",
"nsibcat")
# Loop over the variables:
restab452a <- purrr::map(
# For each of the variables listed:
.x = varlist,
# Run the following function
# (replacing x with each variable one-by-one)
.f = function(x) {
# Apply the study design:
vce_design %>%
# purrr replaces x with each of the variables listed
# x is in curly brackets so it is interpreted as a
# variable name instead of a character
sitrep::tab_survey({x},
strata = NULL,
keep = TRUE,
na.rm = TRUE,
method = "xlogit",
wide = FALSE,
deff = TRUE,
pretty = FALSE)
}
) %>%
bind_rows()
D - create summary table of results
In this code chunk, we will use flextable
as before, to
create a publishable summary of the results.
Table 5.1 - Counts and proportions of parental characteristics adjusted by survey design
# Create publishable table of results:
restabpub <- restab452a %>%
# Now we can convert the estimates and 95% CI to percentages:
mutate(across(contains("proportion"), ~ scales::percent(.x, accuracy = 0.01))) %>%
# And finally we can trim the decimal places for other columns:
mutate(across(c(n, deff), ~ round(.x, digits = 2))) %>%
# Create merged column for 95% CIs:
tidyr::unite(col = "CI95",
proportion_low, proportion_upp,
sep = " - ",
remove = TRUE) %>%
# Add nice names for stratifier variables to the table:
mutate(variable = case_when(
variable == "minority" ~ "Minority group",
variable == "magegroup" ~ "Mother's age at childbirth",
variable == "nsibcat" ~ "Number of other siblings")) %>%
# Select the columns to print:
dplyr::select(variable,
value,
deff,
n,
proportion,
CI95) %>%
# Lastly we can convert it to a flextable for printing:
flextable::qflextable() %>%
# Merge cells in the stratifier colum as the values are repeated:
flextable::merge_v(j = "variable", target = "variable") %>%
# Add labels:
flextable::set_header_labels(values = list(variable = "Stratifier",
value = "Stratum",
deff = "Design effect",
n = "VCE (n)",
proportion = "VCE (%)",
CI95 = "95% CI")) %>%
# Right-align last 2 columns (weighted VCE and 95% CI):
flextable::align(j = 5:6, align = "right", part = "all") %>%
# Add borders under the merged cells:
flextable::hline(i = rle(cumsum(.$body$spans$columns[,1] ))$values,
border = fp_border(style = "solid", width = 1)) %>%
# Add bottom border back:
flextable::hline_bottom(border = fp_border(style = "solid",
width = 2),
part = "body") %>%
# Fix thickness of bottom border:
fix_border_issues() %>%
# Adjust column widths to fit text:
flextable::set_table_properties(width = 1, layout = "autofit")
################
# Print results
###############
restabpub
Stratifier | Stratum | Design effect | VCE (n) | VCE (%) | 95% CI |
---|---|---|---|---|---|
Minority group | General population | 2.71 | 71,758.97 | 89.75% | 87.98% - 91.28% |
Roma | 4.48 | 613.05 | 0.77% | 0.35% - 1.69% | |
Greek Muslims | 2.08 | 1,404.21 | 1.76% | 1.23% - 2.50% | |
Immigrants | 2.57 | 6,179.91 | 7.73% | 6.43% - 9.26% | |
Mother's age at childbirth | Under 25 | 1.38 | 617.59 | 0.78% | 0.51% - 1.21% |
25 - 29 | 1.98 | 10,399.34 | 13.21% | 11.71% - 14.87% | |
30 or over | 2.03 | 67,699.67 | 86.00% | 84.28% - 87.57% | |
Number of other siblings | 3 or more | 2.07 | 5,640.21 | 7.17% | 6.03% - 8.51% |
1 - 2 | 1.51 | 60,638.37 | 77.10% | 75.34% - 78.77% | |
No siblings | 1.76 | 12,372.89 | 15.73% | 14.19% - 17.41% |
Summarising the results:
Parents/guardians indicated a high degree of perceived necessity of immunization with the vast majority agreeing that vaccination is beneficial for their children. In addition, 76% considered that vaccines are completely safe and never harm child health and 45% felt that children should be immediately immunized with every newly introduced vaccine (data not shown here), reflecting an uncritical positive opinion on vaccination. Only 3% feared that vaccines may expose their child to a substantial risk, suggesting minimal concerns over the side effects of vaccines. Nevertheless, one fifth reported that natural childhood disease may be preferable to vaccination, reflecting perceptions of less susceptibility to and severity of vaccine preventable diseases. More skepticism was observed regarding new vaccines and accuracy of recommendations with approximately one third feeling confused due to conflicting physicians’ opinions, almost half stating the need to cross-check doctors’ recommendations and about 60% believing that new vaccines are developed for financial profit (data not shown here). Despite these statements, however, parents/guardians indicated a high degree of compliance with the current official recommendations on vaccination, with almost all stating that they would keep their child’s immunization up-to-date, following the current scientific advice (data not shown).
Among all respondents, cost (including the cost of the visit to an immunization provider, the fee for vaccine administration and the cost of the vaccine itself) was the barrier most commonly identified. About one fifth of respondents cited long distance to vaccination location and/or inconvenient opening hours of the immunization sites as important barriers to immunization. Other obstacles less commonly reported (not shown here) included lack of accurate information regarding immunization, unfriendly behavior of immunization staff or poor organization of services, lack of confidence in social insurance doctors, lack of paediatricians and long waiting times for appointments.
Background:
In this task, you will continue with analysis of parental characteristics and attitudes towards vaccination and look at the associations between these factors and vaccination coverage. You will do this by calculating bivariate statistics (vaccine coverage stratified by different parental characteristics and attitudes) as well as crude (unadjusted) prevalence ratios (univariable analysis). If you have extra time, you can also perform univariable analysis to calculate crude odds ratios.
Below is a brief description of the rationale for the analytical approach used in this session. Further reading on these issues is provided at the end of this guide in the references section.
Bivariate statistics:
As a first step in the analytical study, it is important to tabulate bivariate statistics, or create a 2x2 table, which describe the proportion and number of fully vaccinated children (outcome) by each of the potential exposure variables, e.g., parental characteristics, parental beliefs/attitudes about immunization, and perceived barriers to vaccination. Similar to the previous sections, the tabulation of summary statistics for the target population use weighted proportions to account for different probabilities of selection of children into the study sample.
Study of associations:
For the analyses of associations between potential exposures and the outcome, there is (arguably) less concern about the representativeness of identified associations for the target population. In addition, it may be difficult to assume that the estimates of associations, and their corresponding uncertainty intervals, would in fact be applicable to the whole population. For these reasons, weights are not used in regression models in this case study. Nevertheless, the analysis of cluster sample surveys must take into account the clustering of observations within primary sampling units (i.e. school classes) due to the complex survey design. Multilevel models, also known as mixed-effects models, are used in these situations in order to adjust for non-independence of study observations, and depending on the research interest, quantify the extent of clustering at the contextual level (e.g., schools, hospitals, neighbourhoods). To account for the cluster sampling design in our analysis, the primary sampling unit (school class) is specified as the random effect, or cluster variable, and all potential exposure variables are specified in the multilevel model as fixed effects nested within the primary sampling unit (school class).
Measures of associations:
Logistic regression models are often used to study associations in cross-sectional studies with binary outcomes (e.g., having a fully vaccinated child or not). However, these models estimate an odds ratio as the measure of association. Odds ratios overestimate the risk ratio when an outcome is common, or roughly above 10% prevalence. The odds ratio may therefore not be interpreted as a risk of the disease outcome, which is generally easier to understand and is the preferred measure of association. In these situations, it is preferable to estimate a prevalence ratio as a better approximation of the risk ratio. This can be done using a Poisson regression model with robust error variance. For these reasons, this case study uses the prevalence ratio as the measure of association to estimate the risk of having a fully vaccinated child. As an optional activity at the end of this session, we provide commands to calculate the odds ratio using a mixed-effects logistic regression model, which still provides a valid measure of association but the coefficient (odds ratio) will be more difficult to interpret than a risk ratio.
Task description:
Tabulate bivariate statistics (create a 2x2 table) to describe
parental characteristics, parental beliefs/attitudes about immunization
and perceived barriers to vaccination in relation to the binary outcome
of having a fully vaccinated child (vacful
) or not. Use the
explanatory variables listed in table 5.0.1 and the same survey design
as for previous sections, taking into account weights
(weight
), clustering (school
) and strata
(strata
).
Calculate crude measures of association, with corresponding 95%
confidence intervals, between the above factors and having a fully
vaccinated child (using vacful
as the study outcome). Use a
Poisson regression mixed effects model with robust error variance to
calculate prevalence ratios, taking clustering into account by including
school
as the random effect, or cluster variable.
If you have time, complete the optional exercise and calculate
crude odds ratios, using vacful
as the outcome measure and
the factors listed in table 5.0.1 as explanatory variables. As with
prevalence ratios, include school
as the random effect or
cluster variable.
R coding tips:
A - Counts of fully vaccinated children:
In this section we will create a table of counts for fully vaccinated
children (vacful
), cross-tabulated by each of the variables
listed in allvars
. We can use the sitrep
function tab_linelist()
to generate the counts for each
factor level in each variable from allvars
.
We will also add human readable question labels to the variable names, which will make the final results table easier to understand. We will use this as the base table and add other results to it in the following sections.
restab52a <- vaccine %>%
# Filter for those that are fully vaccinated only:
dplyr::filter(vacful == 1) %>%
# Create summary table:
sitrep::tab_linelist(all_of(allvars)) %>%
# Drop un-weighted proportions variable
# (we will add weighted proportions later)
dplyr::select(-proportion) %>%
# We can use our dataframe of labels to recode the variable column
# Note the quotations to define which variables in recoders to use
# you could also simply remove these as the default recodes from col 1 to 2
mutate(variable = matchmaker::match_vec(variable,
dictionary = recoders,
from = "allvars", to = "allvarlabs"))
B - Vaccination coverage stratified by parental characteristics and beliefs:
Here we will repeat the same procedure using
sitrep::tab_survey()
as demonstrated in previous sections
to estimate full vaccination coverage (vacful
), taking into
account weights, strata and clustering in vce_design
and
stratifying the results by each of the variables in
allvars
.
We will loop through each of the variables using
purrr::map()
as before.
We will then append the results to the table of counts we created in
the previous section, using bind_cols()
.
#################################
# 01. Calculate VCE proportions:
#################################
restab_vce <- purrr::map(
# allvars is the list of variables to loop through that we already created:
.x = allvars,
.f = function(x) {
vce_design %>%
# VCE will be calculated for vacful
sitrep::tab_survey(vacful,
# Each variable in allvars is a stratum
strata = all_of({x}),
# We are only interested in vacful == 1
keep = c(1),
na.rm = TRUE,
method = "xlogit",
wide = FALSE,
deff = TRUE,
pretty = FALSE) %>%
# Rename factor level column to value:
dplyr::rename(response = x) %>%
# Remove rows with missing values:
filter(!is.na(response)) %>%
# Add the variable name to make merging more accurate:
mutate(variable = x)
}
) %>%
# Bind the results from each variable into a data.frame:
bind_rows() %>%
# Remove value = 1 grouping (from vacful == 1)
ungroup %>%
# Trim the decimal places for other columns:
mutate(across(c(n, deff),
~ round(.x, digits = 2))) %>%
# Convert the estimates and 95% CI to percentages:
mutate(across(contains("proportion"),
~ scales::percent(.x, # Cols to convert to %
accuracy = 0.1, # No. decimal places
suffix = ""))) %>% # Do not add % sign
# Create merged column for 95% CIs:
tidyr::unite(col = "CI95", # Name of new column in quotes
proportion_low, proportion_upp, # Two columns to combine
sep = " - ", # Separator between values
remove = TRUE) %>% # Remove old columns
# Rename proportion column to avoid confusion when merging tables:
dplyr::rename(vce_prop = proportion) %>%
# Add human readable labels to variable column:
mutate(variable = matchmaker::match_vec(variable, recoders)) %>%
# Select the columns to keep:
dplyr::select(variable, # Variable name from allvars (to group)
value = response, # Rename to match restab52a (to group)
deff, # Design effect for reference
n, # Estimated count for reference
vce_prop, # VCE (%) to add to table
CI95) # VCE 95% CIs for reference
This survey estimated that in total, 63.8% of children in Greece were fully vaccinated at the time of the survey. Among Roma, Greek Muslims and immigrant groups, the percentage of fully vaccinated children were 40.1% (95% CI: 22.7 - 60.4), 52.8% (95% CI: 40.8 - 64.5) and 29.9% (95% CI: 24.7 - 35.8), respectively. In comparison, 67.1% (95% CI: 64.5 - 69.6) of children in non-minority groups in Greece were fully vaccinated at the time of the survey.
Now we can bind the first two tables together:
# Join vaccine coverage to counts table:
restab52b <- restab52a %>%
# Merge the two tables with a left join on variable and response:
left_join(y = dplyr::select(.data = restab_vce,
variable,
value,
vce_prop),
# Join by both variable and value
# This is safer than value alone
# as yes/no answers are not unique
by = c("variable", "value")) %>%
# Add brackets around vaccine coverage estimate proportion:
# note variables are between curly brackets {} within double quotation marks
mutate(vce_prop = stringr::str_glue("({vce_prop})")) %>%
# Create combined count and estimated vaccine coverage column:
tidyr::unite(col = "n_vce",
n, vce_prop,
sep = " ",
remove = TRUE)
D - Crude vaccination prevalence ratios by parental characteristics and beliefs:
In this section, you will create a Poisson mixed effects model with
robust error variance to calculate crude prevalence ratios, using the
feglm()
function from the fixest
package. This
package was specifically designed to make it easier to replicate model
designs from STATA code in R. You can read more about the methods in the
package
vignette. There are also some helpful cheat sheets that show how to
convert model code from STATA to R using the fixest
package
(see here
and here).
Importantly for this case study, the fixest
package also
has an implementation of the Huber sandwich estimator; alternately this
method is referred to as the heteroskedasticity-consistent error
correction for robust error variance. In the fixest
package, this method can be selected by setting the vcov
argument to hetero
(the R equivalent of STATA’s
vce(robust)
). Use of this correction should result in more
accurate estimates of standard errors (and therefore confidence
intervals).
The fixest::feglm()
function is specifically designed
for generalised linear models with one or multiple fixed effects (the
explanatory variables - in this case parental characteristics). It will
also take a clustering variable (e.g. school) and use this in
combination with the selected error variance method (e.g. hetero) to
calculate the standard errors.
The function takes the following key arguments:
outcome ~ dependent
(e.g.
vacful ~ minority
)poisson(link = "log")
hetero ~ school
vaccine
Note:
First, we will regress full vaccination status (vacful
)
against minority group membership (minority
) and inspect
the results to get a sense of how the model works. We will use
fixest::etable()
to look at the overall summary figures for
the model.
We will then use the broom.mixed
package (which is part
of the tidyverse) to extract the estimates, 95% confidence intervals and
p values into a table.
Note that in R, for models where the link
is set to
“log” or “logit”, estimates will be calculated as log to the base 10, by
default. In order to obtain prevalence, risk or odds ratios, these
estimates and their 95% confidence intervals need to be exponentiated.
The tidy()
function in the broom
and
broom.mixed
packages will already do this for commonly used
models, when the argument exponentiate
is set to
TRUE
. However, this has not yet been implemented for the
fixest
package as it is still relatively new. Fortunately,
we can easily exponentiate the extracted estimates by using the base R
function exp()
.
Table 5.2.1 - Vaccination prevalence ratio by minority group
# Create the model:
model1 <- fixest::feglm(fml = vacful ~ minority, # Model formula
family = poisson(link = "log"), # Model type
vcov = hetero ~ school, # SE method & cluster
data = vaccine) # Data set
# View a summary of the results:
fixest::etable(model1)
## model1
## Dependent Var.: vacful
##
## Constant -0.4145*** (0.0134)
## minorityRoma -0.8848** (0.2847)
## minorityGreekMuslims -0.2504*** (0.0635)
## minorityImmigrants -0.7388*** (0.0953)
## ____________________ ___________________
## S.E. type Heteroskedast.-rob.
## Observations 3,404
## Squared Cor. 0.04257
## Pseudo R2 0.01039
## BIC 6,211.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use the broom.mixed package to extract prevalence ratios and 95% CI:
model1tab <- broom.mixed::tidy(x = model1,
conf.int = TRUE,
conf.method = "Wald") %>%
# Exponentiate estimate and 95% CIs:
mutate(across(
.cols = c(estimate, starts_with("conf.")),
.fns = exp)) %>%
# Round up numeric columns to 2 decimal places:
mutate(across(
.cols = where(is.numeric),
.fns = round, digits = 2)) %>%
# Convert to flextable for printing:
flextable::qflextable()
# View the results:
model1tab
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.66 | 0.01 | -31.03 | 0 | 0.64 | 0.68 |
minorityRoma | 0.41 | 0.28 | -3.11 | 0 | 0.24 | 0.72 |
minorityGreek Muslims | 0.78 | 0.06 | -3.94 | 0 | 0.69 | 0.88 |
minorityImmigrants | 0.48 | 0.10 | -7.76 | 0 | 0.40 | 0.58 |
This model shows that compared to the reference level (general population), the included minority groups (Roma, Greek Muslims and Immigrants) all had significantly lower vaccination prevalence ratios (the prevalence ratios are all less than 1 and the 95% confidence intervals do not cross 1).
We can now take advantage of the convenience function
purrr::map()
, to iterate through all the variables in our
list (allvars
) and calculate vaccination prevalence ratios
for each of them. As before, this is most easily accomplished if we
first create a function to specify the arguments in the model that will
stay the same for all the variables:
# Function to create Poisson fixed effects model with robust errors:
poisson_tabulator <- function(predictor) {
# Construct model formula:
pm_formula = as.formula(stringr::str_glue("vacful ~ {predictor}"))
# Construct the poisson model clustering by school with robust errors:
modtab = fixest::feglm(fml = pm_formula,
family = poisson(link = "log"),
vcov = hetero ~ school,
data = vaccine) %>%
# Extract the results in a table:
broom.mixed::tidy(conf.int = TRUE,
conf.method = "Wald") %>%
# Exponentiate estimate and 95% CIs:
mutate(across(
.cols = c(estimate, starts_with("conf.")),
.fns = exp)) %>%
# Round up numeric columns to 2 decimal places:
mutate(across(
.cols = where(is.numeric),
.fns = round, digits = 2)) %>%
# Merge lower and upper 95% CI in a new column:
tidyr::unite(col = "95% CI",
starts_with("conf."),
sep = " - ",
remove = TRUE) %>%
# Remove intercept:
dplyr::filter(term != "(Intercept)")
# Return the results table:
return(modtab)
}
Next, we will apply the function to each of the variables in
allvars
by looping through them with
purrr::map()
.
Note that the results are identified by the variable name and factor
level concatenated together in one word, which is not easy for external
readers to interpret. To fix this, we will split the variable names and
factor levels into two separate columns, variable
and
value
using the stringr
package functions
str_extract()
and str_remove()
. The
|
symbol in R means OR
and can be used to
search for any of the variables in allvars
when pasted
between the variable names (e.g. minority|magegroup|educf
means minority
OR magegroup
OR
educf
).
# Apply the model to the list of variables called 'allvars':
poiscrudetab <- allvars %>%
# Loop them through the 'poison_tabulator' function we just created:
purrr::map(poisson_tabulator) %>%
# Bind the results into a single table:
bind_rows() %>%
# Split term into two columns (variable name and level / value):
dplyr::mutate(
# Create variable col (strings in 'term' that match any of allvars):
variable = stringr::str_extract(term,
str_c(allvars, collapse = "|")),
# Create value col (whats left after strings matching allvars are removed):
value = stringr::str_remove(term,
str_c(allvars, collapse = "|"))) %>%
# Add human readable labels to values in the variable column:
mutate(variable = matchmaker::match_vec(variable, recoders)) %>%
# Remove unnecessary columns:
dplyr::select(-term, -std.error, -statistic) %>%
# Reorder new columns:
relocate(c(variable, value), .before = estimate) %>%
# Put estimate with 95% CI:
relocate(p.value, .before = estimate)
Finally, generic dplyr
functions can be used to further
arrange the results as desired and they can be printed with the
flextable
package, as before.
Table 5.2.2 - Vaccination prevalence ratios by parental characteristics and beliefs
poiscrudetabpub <- poiscrudetab %>%
# Convert to a flextable:
flextable::qflextable() %>%
# Merge cells in the vaccine colum as the values are repeated:
flextable::merge_v(j = c("variable"), target = c("variable")) %>%
# Label column headers:
flextable::set_header_labels(values = list(
variable = "Parameter",
value = "Characteristic",
p.value = "P value",
estimate = "Prevalence ratio",
`95% CI` = "95% CI")) %>%
# Right align numeric results columns:
flextable::align(j = 3:5,
align = "right",
part = "all") %>%
# Add highlighting of significant values:
flextable::bg(i = ~ p.value < 0.05,
j = 3:5,
bg = "yellow") %>%
# Add border lines under the merged cells:
flextable::hline(i = rle(cumsum(.$body$spans$columns[,1] ))$values,
border = fp_border(style = "solid", width = 1)) %>%
# Fix thickness of bottom border:
fix_border_issues() %>%
# Add bottom border back:
flextable::hline_bottom(border = fp_border(style = "solid", width = 2),
part = "body") %>%
# Adjust column widths to fit text:
flextable::set_table_properties(width = 1, layout = "autofit")
##################
# Print the table:
poiscrudetabpub
Parameter | Characteristic | P value | Prevalence ratio | 95% CI |
---|---|---|---|---|
Minority group | Roma | 0.00 | 0.41 | 0.24 - 0.72 |
Greek Muslims | 0.00 | 0.78 | 0.69 - 0.88 | |
Immigrants | 0.00 | 0.48 | 0.4 - 0.58 | |
Mother's age at childbirth | 25 - 29 | 0.14 | 1.40 | 0.9 - 2.18 |
30 or over | 0.01 | 1.79 | 1.16 - 2.77 | |
Father's education | High school (9 - 11 years) | 0.44 | 1.04 | 0.94 - 1.16 |
College (12 years) | 0.00 | 1.15 | 1.06 - 1.26 | |
University (> 12 years) | 0.00 | 1.30 | 1.19 - 1.41 | |
Number of siblings | 1 - 2 | 0.00 | 1.30 | 1.14 - 1.47 |
No siblings | 0.00 | 1.29 | 1.12 - 1.48 | |
Balanced positive opinion on vaccination: likhert | Moderate positive opinion | 0.06 | 1.35 | 0.99 - 1.84 |
Strong positive opinion | 0.01 | 1.49 | 1.1 - 2.03 | |
Balanced positive opinion on vaccination: binary | yes | 0.01 | 1.47 | 1.08 - 1.99 |
Uncritical positive opinion on vaccination | yes | 0.24 | 0.97 | 0.92 - 1.02 |
Vaccination is necessary for my child | yes | 0.42 | 1.56 | 0.53 - 4.56 |
Vaccines may harm my child | yes | 0.00 | 0.77 | 0.65 - 0.92 |
Natural disease preferable to vaccines | yes | 0.01 | 0.90 | 0.84 - 0.97 |
Long distance to immunization site | yes | 0.00 | 0.83 | 0.77 - 0.89 |
Inconvenient opening hours | yes | 0.00 | 0.89 | 0.83 - 0.96 |
Cost of vaccines | yes | 0.42 | 1.02 | 0.97 - 1.08 |
We can now merge these results with the table containing counts and
vaccine coverage, using thevariable
and value
columns to ensure the matches are unique. Note that in the Poisson crude
prevalence ratios table, we removed the intercepts (reference values).
When the two tables are merged, the prevalence ratio for reference
values will present as missing (NA
) which we can then
recode to ref
with the tidyr::replace_na()
function.
#######################################
# Subsetting Poisson results for merge:
#######################################
# Subset the necessary columns to merge:
poiscrudetabmini <- poiscrudetab %>%
# Combine prevalence ratios and 95% CIs in one column:
mutate(pr_95 = str_glue("{estimate} ({`95% CI`})")) %>%
# Remove unnecessary columns:
dplyr::select(-estimate, -`95% CI`, -p.value)
###############################################
# Merge vaccine coverage and prevalence ratios:
###############################################
# Merge the two tables with a left join on variable and value columns:
restab52c <- restab52b %>%
# Join the counts / vaccine coverage table to the poisson table on response:
left_join(y = dplyr::select(.data = poiscrudetabmini,
# Columns to include:
variable,
value,
pr_95),
# Join by variable and value to ensure unique matches:
by = c("variable", "value")) %>%
# Encode the missing values for intercepts as 'ref'
# Otherwise leave the rest of the values as they are:
mutate(pr_95 = tidyr::replace_na(pr_95, "ref"))
E - Crude odds ratios for vaccination by parental characteristics and beliefs (optional exercise)
If you have extra time, it may be interesting to calculate crude odds ratios and compare them to the prevalence ratios that we calculated in the previous section.
This can easily be done by repeating the same steps used to calculate
prevalence ratios, but this time selecting
family = binomial(link = "logit")
instead of Poisson in the
fixest::feglm()
function.
Below is the code to do this, wrapping the model commands in a
function as before and applying it to allvars
. For
convenience, we will then add the results to the table we already
created, so that we can see them all together.
##############################################
# 01. Function to tabulate crude odds ratios
#############################################
# Function to create Poisson fixed effects model with robust errors:
or_tabulator <- function(predictor) {
# Construct model formula:
pm_formula = as.formula(str_glue("vacful ~ {predictor}"))
# Construct the poisson model clustering by school with robust errors:
modtab = fixest::feglm(fml = pm_formula,
family = binomial(link = "logit"),
vcov = hetero ~ school,
data = vaccine) %>%
# Extract the results in a table:
broom.mixed::tidy(conf.int = TRUE,
conf.method = "Wald") %>%
# Exponentiate estimate and 95% CIs:
mutate(across(
.cols = c(estimate, starts_with("conf.")),
.fns = exp)) %>%
# Round up numeric columns to 2 decimal places:
mutate(across(
.cols = where(is.numeric),
.fns = round, digits = 2)) %>%
# Merge lower and upper 95% CI in a new column:
tidyr::unite(col = "95% CI",
starts_with("conf."),
sep = " - ",
remove = TRUE) %>%
# Remove intercept:
dplyr::filter(term != "(Intercept)")
# Return the results table:
return(modtab)
}
Now we can apply the or_tabulator
function to
allvars
:
# Apply the model to the list of variables called 'allvars':
orcrudetab <- allvars %>%
# Loop them through the 'poison_tabulator' function we just created:
purrr::map(or_tabulator) %>%
# Bind the results into a single table:
bind_rows() %>%
# Split term into two columns (variable name and level / value):
dplyr::mutate(
# Create variable col (strings in 'term' that match any of allvars):
variable = stringr::str_extract(term,
str_c(allvars, collapse = "|")),
# Create value col (whats left after strings matching allvars are removed):
value = stringr::str_remove(term,
str_c(allvars, collapse = "|"))) %>%
# Add human readable labels to values in the variable column:
mutate(variable = matchmaker::match_vec(variable, recoders)) %>%
# Remove unnecessary columns:
dplyr::select(-term, -std.error, -statistic) %>%
# Reorder new columns:
relocate(c(variable, value), .before = estimate) %>%
# Put estimate with 95% CI:
relocate(p.value, .before = estimate)
We can merge the crude odds ratios with the table that now contains vaccination coverage and prevalence ratios:
#######################################
# Subsetting Poisson results for merge:
#######################################
# Subset the necessary columns to merge:
orcrudetabmini <- orcrudetab %>%
# Combine prevalence ratios and 95% CIs in one column:
mutate(or_95 = str_glue("{estimate} ({`95% CI`})")) %>%
# Remove unnecessary columns:
dplyr::select(-estimate, -`95% CI`, -p.value)
###############################################
# Merge vaccine coverage and prevalence ratios:
###############################################
# Merge the two tables with a left join on variable and value columns:
restab52d <- restab52c %>%
# Join the counts / vaccine coverage table to the poisson table on response:
left_join(y = dplyr::select(.data = orcrudetabmini,
# Columns to include:
variable,
value,
or_95),
# Join by variable and value to ensure unique matches:
by = c("variable", "value")) %>%
# Encode the missing values for intercepts as 'ref'
# Otherwise leave the rest of the values as they are:
mutate(or_95 = tidyr::replace_na(or_95, "ref"))
Finally, we can print the combined results table using
flextable
as before:
Table 5.2.3 - Crude measures of association between vaccination and parental characteristics
crudefinaltabpub <- restab52d %>%
# Convert to a flextable:
flextable::qflextable() %>%
# Merge cells in the vaccine colum as the values are repeated:
flextable::merge_v(j = c("variable"), target = c("variable")) %>%
# Label column headers:
flextable::set_header_labels(values = list(
variable = "Parameter",
value = "Characteristic",
n_vce = "N (VCE)",
pr_95 = "Prevalence ratio (95% CI)",
or_95 = "Odds ratio (95% CI)")) %>%
# Right align numeric results columns:
flextable::align(j = 3:5,
align = "right",
part = "all") %>%
# Add borders under the merged cells:
flextable::hline(i = rle(cumsum(.$body$spans$columns[,1] ))$values,
border = fp_border(style = "solid", width = 1)) %>%
# Fix thickness of bottom border:
fix_border_issues() %>%
# Add bottom border back:
flextable::hline_bottom(border = fp_border(style = "solid", width = 2),
part = "body") %>%
# Adjust column widths to fit text:
flextable::set_table_properties(width = 1, layout = "autofit")
##################
# Print the table:
crudefinaltabpub
Parameter | Characteristic | N (VCE) | Prevalence ratio (95% CI) | Odds ratio (95% CI) |
---|---|---|---|---|
Minority group | General population | 1904 (67.1) | ref | ref |
Roma | 9 (40.1) | 0.41 (0.24 - 0.72) | 0.19 (0.09 - 0.42) | |
Greek Muslims | 126 (52.8) | 0.78 (0.69 - 0.88) | 0.54 (0.42 - 0.71) | |
Immigrants | 77 (29.9) | 0.48 (0.4 - 0.58) | 0.24 (0.18 - 0.31) | |
Mother's age at childbirth | Under 25 | 13 (38.6) | ref | ref |
25 - 29 | 247 (52.8) | 1.4 (0.9 - 2.18) | 1.81 (0.9 - 3.66) | |
30 or over | 1823 (65.8) | 1.79 (1.16 - 2.77) | 3.24 (1.63 - 6.42) | |
Father's education | Grammar school (< 9 years) | 300 (55.9) | ref | ref |
High school (9 - 11 years) | 302 (58.5) | 1.04 (0.94 - 1.16) | 1.1 (0.86 - 1.4) | |
College (12 years) | 706 (64.6) | 1.15 (1.06 - 1.26) | 1.41 (1.15 - 1.74) | |
University (> 12 years) | 666 (70.5) | 1.3 (1.19 - 1.41) | 2.01 (1.62 - 2.5) | |
Number of siblings | 3 or more | 124 (48.6) | ref | ref |
1 - 2 | 1630 (65.3) | 1.3 (1.14 - 1.47) | 1.82 (1.4 - 2.35) | |
No siblings | 336 (65.4) | 1.29 (1.12 - 1.48) | 1.78 (1.31 - 2.41) | |
Balanced positive opinion on vaccination: likhert | No positive opinion | 24 (52.3) | ref | ref |
Moderate positive opinion | 330 (59.6) | 1.35 (0.99 - 1.84) | 1.83 (1.05 - 3.18) | |
Strong positive opinion | 1663 (65.5) | 1.49 (1.1 - 2.03) | 2.38 (1.39 - 4.06) | |
Balanced positive opinion on vaccination: binary | no | 24 (52.3) | ref | ref |
yes | 1993 (64.5) | 1.47 (1.08 - 1.99) | 2.26 (1.33 - 3.86) | |
Uncritical positive opinion on vaccination | no | 1068 (65.2) | ref | ref |
yes | 988 (63.1) | 0.97 (0.92 - 1.02) | 0.92 (0.8 - 1.06) | |
Vaccination is necessary for my child | no | 2 (51.1) | ref | ref |
yes | 2093 (64.0) | 1.56 (0.53 - 4.56) | 2.49 (0.41 - 14.9) | |
Vaccines may harm my child | no | 1882 (64.4) | ref | ref |
yes | 64 (54.8) | 0.77 (0.65 - 0.92) | 0.55 (0.39 - 0.79) | |
Natural disease preferable to vaccines | no | 1637 (65.6) | ref | ref |
yes | 359 (58.5) | 0.9 (0.84 - 0.97) | 0.77 (0.64 - 0.92) | |
Long distance to immunization site | no | 1700 (66.0) | ref | ref |
yes | 356 (56.0) | 0.83 (0.77 - 0.89) | 0.63 (0.53 - 0.75) | |
Inconvenient opening hours | no | 1646 (65.3) | ref | ref |
yes | 383 (59.6) | 0.89 (0.83 - 0.96) | 0.75 (0.63 - 0.89) | |
Cost of vaccines | no | 1244 (64.4) | ref | ref |
yes | 795 (63.8) | 1.02 (0.97 - 1.08) | 1.06 (0.92 - 1.23) |
Interpreting the results (Table 5.2.3):
Socioeconomic factors:
Based on crude prevalence ratios (unadjusted for other factors), children were less likely to be fully vaccinated if they belonged to a minority group, were born to a younger mother, had 3+ siblings, or if their father was less educated.
Beliefs and attitudes towards immunization:
Based on crude prevalence ratios (unadjusted for other factors), children were less likely to be fully vaccinated if the parents held negative opinions about vaccination, believed that vaccines may harm their child, or believed that natural childhood disease is preferable to vaccination compared to parents who did not hold these beliefs.
Barriers to immunization:
Based on crude prevalence ratios (unadjusted for other factors), children were less likely to be fully vaccinated if their parents believed that it was a long distance to the immunization site or if they believed the immunization site had inconvenient opening hours, compared to parents who did not hold these beliefs.
Background:
This task will build on the previous task, calculating prevalence ratios as before but in a multivariable, multilevel model, using all the same variables as for task 5.2.
Key considerations:
Because this is a clustered sampling survey, the model must take into account the non-independence of observations within the primary sampling unit (in this case school classes).
This can be achieved with mixed effect models, specifying the
primary sampling unit (school class) as the random effect, or cluster
variable and regressing the outcome (vacful
) against all
the explanatory variables (parent characteristics and attitudes towards
vaccination) as fixed effects nested within school classes.
As for unadjusted measures of association, calculating prevalence ratios from poisson regression with robust error variance is likely to give better estimates than odds ratios calculated from logistic regression, as these tend over-estimate the risk ratio when an outcome is common (~ > 10%).
Robust error variance refers to the Huber correction for calculating standard errors (the square root of the elements on the diagonal of the covariance matrix), also known as heteroskedasticity-consistent standard errors. This is sometimes referred to as the Huber sandwich estimator.
This approach (poisson mixed effects model with robust error variance) follows the strategy first proposed by Zou et al. (2004) - however for an alternative perspective favouring the log-binomial method for moderate prevalence and/or sample size, see Petersen and Deddens (2008). Note that in this data set we have already seen that the vaccination coverage estimate is quite high (mean 58.5, median 63.1 from table 5.2.3), so the Poisson approach is more appropriate.
Task description:
In this task you will need to undertake the following steps:
R coding tips:
Before proceeding, we can consider if we are happy with the reference
values that were used for each categorical variable in task 5.2. Note
that for the first variable, minority
, using
General population
as the reference value yielded a
prevalence ratio of less than 1 for all the other groups. It might be
more intuitive to use Roma
as the reference group, since
this group has the lowest vaccination coverage. We can change the
reference value by changing the order of the factor levels, using
forcats::fct_relelevel()
as shown below:
# Using the vaccine data:
vaccine <- vaccine %>%
# Change the order of the factor levels so Gen Pop comes last
# This will make Roma the first level:
dplyr::mutate(minority = forcats::fct_relevel(.f = minority,
"General population",
after = Inf))
As before, we will use the fixest::feglm()
function to
create the Poisson model with robust error variance and calculate
prevalence ratios. This time, however, instead of calculating prevalence
ratios for each explanatory variable separately, we will create a model
formula that includes all the variables, so that the adjusted estimates
will reflect the relative contribution of each variable to the
model.
In R, we can create a multivariable model and include all the
variables in allvars
as fixed effects, by separating them
with a +
sign in the model formula,
e.g. fml = vacful ~ minority + magegroup
.
As we have a lot of variables to include in the model (n =
13), it would be more convenient to pass them to the model in the list
we already made (allvars
). Fortunately, the
fixest
package has a convenient way for us to code this: we
can simply put the name of our variable list in the model formula,
enclosed in square brackets and preceded by a dot:
fml = vacful ~ .[allvars]
. The model will be built by
placing a plus sign between each of the explanatory variables in
allvars
on the right hand side of the tilde
(~
).
# Create the Poisson model with robust error variance:
modelres <- vaccine %>%
# Create fixed effects poisson model clustering by school:
fixest::feglm(fml = vacful ~ .[allvars],
family = poisson(link = "log"),
vcov = hetero ~ school)
The fixest::feglm()
function has a number of other
convenient features. You may have noticed that after running the
model:
Missing values were automatically removed (and the message tells you how many);
Where co-linearity is detected between variables, one of the variables is automatically removed.
In this case, the variable a1posyn
has been removed
because it is co-linear with a1posg
. This is expected,
given that both these variables are alternative ways of coding responses
to the same question (does the respondent have a balanced positive
opinion on vaccination); a1posyn
is the binary version
(with yes or no as answers), while a1posg
is the
categorical multi-level version (with no, moderate or strong positive
opinion as possible answers).
If you wish to inspect the co-linearity scores, you can do this by
viewing the $collin.var
element of the model results:
# View which elements of the model have been determined to be colinear:
modelres$collin.var
## [1] "a1posynyes"
# View the colinearity coefficients for each element of the model:
modelres$collin.coef
## (Intercept) minorityGreek Muslims
## -1.485094873 0.480180743
## minorityImmigrants minorityGeneral population
## -0.173606814 0.568974530
## magegroup25 - 29 magegroup30 or over
## 0.555846725 0.683856754
## educfHigh school (9 - 11 years) educfCollege (12 years)
## 0.009367583 0.061936454
## educfUniversity (> 12 years) nsibcat1 - 2
## 0.123080067 0.273198848
## nsibcatNo siblings a1posgModerate positive opinion
## 0.295597102 0.040244864
## a1posgStrong positive opinion a1posynyes
## 0.079427042 NA
## uncriticalyes a1x1gyes
## 0.056169176 -0.565787411
## a1x2gyes a1x5gyes
## -0.106672115 -0.051909100
## a3x1gyes a3x2gyes
## -0.068276759 -0.043829466
## a3x5gyes
## 0.045982099
# View a summary of the model results:
fixest::etable(modelres)
## modelres
## Dependent Var.: vacful
##
## Constant -1.485** (0.4540)
## minorityGreekMuslims 0.4802 (0.3701)
## minorityImmigrants -0.1736 (0.3848)
## minorityGeneralpopulation 0.5690 (0.3659)
## magegroup25-29 0.5558. (0.3171)
## magegroup30orover 0.6839* (0.3142)
## educfHighschool(9-11years) 0.0094 (0.0604)
## educfCollege(12years) 0.0619 (0.0524)
## educfUniversity(>12years) 0.1231* (0.0520)
## nsibcat1-2 0.2732*** (0.0773)
## nsibcatNosiblings 0.2956*** (0.0842)
## a1posgModeratepositiveopinion 0.0402 (0.2219)
## a1posgStrongpositiveopinion 0.0794 (0.2375)
## uncriticalyes 0.0562. (0.0296)
## a1x1gyes -0.5658* (0.2261)
## a1x2gyes -0.1067 (0.1176)
## a1x5gyes -0.0519 (0.0638)
## a3x1gyes -0.0683 (0.0496)
## a3x2gyes -0.0438 (0.0452)
## a3x5gyes 0.0460 (0.0308)
## _____________________________ __________________
## S.E. type Heteroskedas.-rob.
## Observations 2,517
## Squared Cor. 0.06336
## Pseudo R2 0.01445
## BIC 4,747.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can now extract the model results and put them in a table, using
broom.mixed::tidy()
as before:
# Extract the model results into a table:
modeltab <- modelres %>%
# Extract the results in a table:
broom.mixed::tidy(conf.int = TRUE,
conf.method = "Wald") %>%
# Exponentiate the estimates and 95% CI to convert to prevalence ratios:
dplyr::mutate(across(c(estimate, starts_with("conf.")), exp)) %>%
# Remove intercept (estimate for reference values):
dplyr::filter(term != "(Intercept)") %>%
# Split term into two columns (variable name and level / value):
dplyr::mutate(
# Create variable col (strings in 'term' that match any of allvars):
variable = stringr::str_extract(term,
str_c(allvars, collapse = "|")),
# Create value col (whats left after strings matching allvars are removed):
value = stringr::str_remove(term,
str_c(allvars, collapse = "|"))) %>%
# Round to 2 decimal places in numeric columns:
mutate(across(
.cols = where(is.numeric),
.fns = round, digits = 2)) %>%
# Merge lower and upper 95% CI in a new column:
tidyr::unite(col = "95% CI",
starts_with("conf."),
sep = " - ",
remove = TRUE) %>%
# Remove unnecessary columns:
dplyr::select(-term, -std.error, -statistic) %>%
# Reorder new columns:
relocate(c(variable, value), .before = estimate) %>%
# Put estimate with 95% CI:
relocate(p.value, .before = estimate)
Note that the reference levels are not included in the table; it
would be useful to add these to facilitate interpretation of the
results. In R, the reference value is always the first level in a
factor. We can identify which is the first level by using the
dplyr
function first()
and combining it with
the base R function to extract the levels,
e.g. first(levels(vaccine$minority)
.
We can do this for all the variables in allvars
by using
summarise(across(allvars, …))
. Then we can
pivot_longer()
to orientate the resultant table so that it
can be merged with the table containing the model results.
# Extract reference levels for all the variables in allvars:
reflevels <- vaccine %>%
# Get first level of each factor in allvars:
summarise(across(.cols = allvars, # Use columns in allvars
.fns = ~ first(levels(.x)))) %>% # Get first factor level
# Transpose the results:
pivot_longer(cols = everything(), # Transpose all columns from previous step
names_to = "variable", # Put the variable names in col 'variable'
values_to = "reference") # Put the ref values in col 'reference'
###############################
# Merge reference values with model results table:
modeltab <- modeltab %>%
# Join the reference values to modeltab:
left_join(y = reflevels,
by = "variable") %>%
# Move the new reference column:
relocate(reference, .before = value) %>%
# Add more human readable labels to the variable column:
mutate(variable = matchmaker::match_vec(variable, recoders))
Finally, we can present the results in a publishable table, using
flextable
.
For convenience, we have highlighted the elements of the model that
are significant according to the Wald test (which is the method we
selected to calculate p values when extracting the model
results with broom.mixed::tidy()
). If desired, you could
limit the final model to include only variables that contain levels
which are statistically significant according to the Wald test.
Table 5.3 - Adjusted measures of association between vaccination status & parental characteristics
modeltabpub <- modeltab %>%
# Convert to a flextable:
flextable::qflextable() %>%
# Merge cells in the vaccine colum as the values are repeated:
flextable::merge_v(j = c("variable"), target = c("variable", "reference")) %>%
# Label column headers:
flextable::set_header_labels(values = list(
variable = "Parameter",
reference = "Reference level",
value = "Characteristic",
p.value = "P value (Wald)",
estimate = "Prevalence ratio",
`95% CI` = "95% CI")) %>%
# Right align numeric results columns:
flextable::align(j = 4:6, align = "right", part = "all") %>%
# Add highlighting of significant values:
flextable::bg(i = ~ p.value < 0.05,
j = 4:6,
bg = "yellow") %>%
# Add borders under the merged cells:
flextable::hline(i = rle(cumsum(.$body$spans$columns[,1] ))$values,
border = fp_border(style = "solid", width = 1)) %>%
# Fix thickness of bottom border:
fix_border_issues() %>%
# Add bottom border back:
flextable::hline_bottom(border = fp_border(style = "solid", width = 2),
part = "body") %>%
# Adjust column widths to fit text:
flextable::set_table_properties(width = 1, layout = "autofit")
###############
# Print table:
##############
modeltabpub
Parameter | Reference level | Characteristic | P value (Wald) | Prevalence ratio | 95% CI |
---|---|---|---|---|---|
Minority group | Roma | Greek Muslims | 0.19 | 1.62 | 0.78 - 3.34 |
Immigrants | 0.65 | 0.84 | 0.4 - 1.79 | ||
General population | 0.12 | 1.77 | 0.86 - 3.62 | ||
Mother's age at childbirth | Under 25 | 25 - 29 | 0.08 | 1.74 | 0.94 - 3.25 |
30 or over | 0.03 | 1.98 | 1.07 - 3.67 | ||
Father's education | Grammar school (< 9 years) | High school (9 - 11 years) | 0.88 | 1.01 | 0.9 - 1.14 |
College (12 years) | 0.24 | 1.06 | 0.96 - 1.18 | ||
University (> 12 years) | 0.02 | 1.13 | 1.02 - 1.25 | ||
Number of siblings | 3 or more | 1 - 2 | 0.00 | 1.31 | 1.13 - 1.53 |
No siblings | 0.00 | 1.34 | 1.14 - 1.59 | ||
Balanced positive opinion on vaccination: likhert | No positive opinion | Moderate positive opinion | 0.86 | 1.04 | 0.67 - 1.61 |
Strong positive opinion | 0.74 | 1.08 | 0.68 - 1.72 | ||
Uncritical positive opinion on vaccination | no | yes | 0.06 | 1.06 | 1 - 1.12 |
Vaccination is necessary for my child | no | yes | 0.01 | 0.57 | 0.36 - 0.88 |
Vaccines may harm my child | no | yes | 0.36 | 0.90 | 0.71 - 1.13 |
Natural disease preferable to vaccines | no | yes | 0.42 | 0.95 | 0.84 - 1.08 |
Long distance to immunization site | no | yes | 0.17 | 0.93 | 0.85 - 1.03 |
Inconvenient opening hours | no | yes | 0.33 | 0.96 | 0.88 - 1.05 |
Cost of vaccines | no | yes | 0.14 | 1.05 | 0.99 - 1.11 |
This study identified several independent determinants of complete vaccination status among 6-year old Greek school children included in the survey. Complete vaccination coverage was significantly lower among children who belonged to a minority group, who had siblings in the household, whose mothers were <30 years, whose father had low educational levels, and whose parents/guardians did not believe that vaccination was necessary for a child. These findings indicate that socioeconomic factors may be more important determinants of immunization coverage than parental immunization beliefs and perceived barriers to vaccination.
Based on the findings of this study, the following identified groups merit increased attention in future interventions designed to improve immunization delivery in Greece:
minority groups, especially Roma and immigrants
families with many children
young mothers, and
households headed by fathers with low educational level
(possibly reflecting low socioeconomic status of the families)
Interventions aimed at those high-risk families, although difficult to deliver, may have the greatest effects on community immunization rates.
In addition, parents/guardians must be educated about the necessity of vaccinations for children’s health.
There is a need for policies to overcome structural and health care system barriers to immunization, and identify effective and comprehensive approaches for improving the immunization levels of children in high-risk groups.
Case study:
Prevalence ratio versus odds ratio:
Model- versus design-based approaches to complex survey analysis:
Forms used in the study are presented below (annex 1 - 4). In addition, the data dictionaries for the data sets that you will use in this case study can be found in annexes 5 - 7.
School registry extraction form
Field worker’s checklist
Table A5 - Data dictionary for “School” database
Variable | Type | Encoding | Definition |
---|---|---|---|
school | Numeric | School/Class | |
schid | Numeric | Student’s ID in school class | |
vaccrec | Numeric | 0:No, 1:Yes | Vaccination Booklet received |
qsrec | Numeric | 0:No, 1:Yes | Questionnaire received |
id | Numeric | (school x 100)+schid | |
region | Numeric | ||
urban | Numeric | 0: Rural, 1:Urban | Rural/Urban |
gender | Numeric | 0:Female, 1:Male, 9:Unknown | Gender (from school record) |
dob1 | Date | Date of birth (from school record) | |
country | String | Country of birth | |
minority | Numeric | Minority group | |
minoryn | Numeric | 0:No, 1:Yes, 9:Unknown | Belongs to minority group |
datecol | Date | Date of booklet collection |
Table A6 - Data dictionary for “Vaccine” database
Variable | Type | Encoding | Definition |
---|---|---|---|
id | Numeric | Unique identifier of students | |
weight | Numeric | ||
datecol | Numeric | Date of booklet collection | |
agey | Numeric | Age in years | |
dob | Date | Date of birth (from vaccination booklet) | |
dtp1 | Date | 1st dose of DTP vaccine | |
dtp2 | Date | 2nd dose of DTP vaccine | |
dtp3 | Date | 3rd dose of DTP vaccine | |
dtp4 | Date | 4th dose of DTP vaccine | |
dtp5 | Date | 5th dose of DTP vaccine | |
po1 | Date | 1st dose of Polio vaccine | |
po2 | Date | 2nd dose of Polio vaccine | |
po3 | Date | 3rd dose of Polio vaccine | |
po4 | Date | 4th dose of Polio vaccine | |
po5 | Date | 5th dose of Polio vaccine | |
potyp1 | Numeric | 1:OPV, 2:IPV, 9:Unknown | Polio 1st dose vaccine type |
potyp2 | Numeric | 1:OPV, 2:IPV, 9:Unknown | Polio 2nd dose vaccine type |
potyp3 | Numeric | 1:OPV, 2:IPV, 9:Unknown | Polio 3rd dose vaccine type |
potyp4 | Numeric | 1:OPV, 2:IPV, 9:Unknown | Polio 4th dose vaccine type |
potyp5 | Numeric | 1:OPV, 2:IPV, 9:Unknown | Polio 5th dose vaccine type |
bcg1 | Date | 1st dose of BCG vaccine | |
bcgl | Date | Last dose of BCG vaccine | |
bcgn | Numeric | Total number of BCG vaccine doses | |
msl1 | Date | 1st dose of Measles vaccine | |
msl2 | Date | 2nd dose of Measles vaccine | |
msl3 | Date | 3rd dose of Measles vaccine | |
rub1 | Date | 1st dose of Rubella vaccine | |
rub2 | Date | 2nd dose of Rubella vaccine | |
rub3 | Date | 3rd dose of Rubella vaccine | |
mu1 | Date | 1st dose of Mumps vaccine | |
mu2 | Date | 2nd dose of Mumps vaccine | |
mu3 | Date | 3rd dose of Mumps vaccine | |
mmr1 | Date | 1st dose of MMR vaccine | |
mmr2 | Date | 2nd dose of MMR vaccine | |
mmr3 | Date | 3rd dose of MMR vaccine | |
hbv1 | Date | 1st dose of HBV vaccine | |
hbv2 | Date | 2nd dose of HBV vaccine | |
hbv3 | Date | 3rd dose of HBV vaccine | |
hbv4 | Date | 4th dose of HBV vaccine | |
hav1 | Date | 1st dose of HAV vaccine | |
hav2 | Date | 2nd dose of HAV vaccine | |
hav3 | Date | 3rd dose of HAV vaccine | |
hav4 | Date | 4th dose of HAV vaccine | |
hib1 | Date | 1st dose of HIB vaccine | |
hib2 | Date | 2nd dose of HIB vaccine | |
hib3 | Date | 3rd dose of HIB vaccine | |
hib4 | Date | 4th dose of HIB vaccine | |
mnc1 | Date | 1st dose of MEN C vaccine | |
mnc2 | Date | 2nd dose of MEN C vaccine | |
mnc3 | Date | 3rd dose of MEN C vaccine | |
mnc4 | Date | 4th dose of MEN C vaccine | |
flu1 | Date | 1st dose of Influenza vaccine | |
flul | Date | Last dose of Infuenza vaccine | |
flun | Date | Total number of Influenza vaccine doses | |
var1 | Date | 1st dose of Varicella vaccine | |
var2 | Date | 2nd dose of Varicella vaccine | |
pne1 | Date | 1st dose of Pneumo vaccine | |
pne2 | Date | 2nd dose of Pneumo vaccine | |
pne3 | Date | 3rd dose of Pneumo vaccine | |
pne4 | Date | 4th dose of Pneumo vaccine | |
other | Numeric | 0:No, 1:Yes | Other vaccine YES/NO |
dtp1plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of DTP-1 administration |
dtp4plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of DTP-4 administration |
dtp5plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of DTP-5 administration |
bcg1plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of BCG-1 administration |
mmr1plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of (MMR or MCV)-1 administration |
mmr2plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of (MMR or MCV)-2 administration |
hbv1plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of HBV-1 administration |
hav1plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of HAV-1 administration |
hib1plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of HIB-1 administration |
mnc1plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of MEN C-1 administration |
pne1plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of PNU-1 administration |
var1plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of VAR-1 administration |
flu1plc | Numeric | 0:No vaccination, 1: State, 2:Private, 3:Abroad | Place of FLU-1 administration |
Table A7 - Data dictionary for “Quest” database
Variable | Type | Encoding | Definition |
---|---|---|---|
A1. Beliefs and attitudes towards immunization | |||
a1x1 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | Do you believe that vaccination is necessary for your child? |
a1x2 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | I’m concerned that vaccines may harm my child |
a1x3 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | Children should be immunized with no delay with every new licensed vaccine |
a1x4 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | I usually need to crosscheck my physician’s advice regarding the immunization for my child |
a1x5 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | Natural disease is preferable to vaccination |
a1x6 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | I’m confused with childhood vaccines, because of physicians’ conflicting opinions |
a1x7 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | Even now as an adult I’m scared of vaccines and injections, due to bad experiences as a child |
a1x8 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | The more vaccines a child gets, the better it is for his health |
a1x9 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | The state provides very limited information on new vaccines to reduce funding for new immunizations |
a1x10 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | I try to keep my child’s immunization up to date, according to the latest scientific advice |
a1x11 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | Vaccines are completely safe and cannot harm child’s health |
a1x12 | Numeric | 1:Agree, 2:Rather agree, 3:Rather disagree, 4:Disagree, 9:Don’t know | Campaigns for new vaccines are done for financial profit |
A3. Which of the following present a problem for the immunization of your child? | |||
a3x1 | Numeric | 1:No, 2:small problem, 3:average problem, 4:big problem | Long distance to the immunization site |
a3x2 | Numeric | 1:No, 2:small problem, 3:average problem, 4:big problem | The OPENING HOURS OF the immunization site |
a3x3 | Numeric | 1:No, 2:small problem, 3:average problem, 4:big problem | Lack of accurate information on vaccines |
a3x4 | Numeric | 1:No, 2:small problem, 3:average problem, 4:big problem | High cost of the vaccines |
a3x5 | Numeric | 1:No, 2:small problem, 3:average problem, 4:big problem | High cost of the immunization visit |
a3x6 | Numeric | 1:No, 2:small problem, 3:average problem, 4:big problem | Inappropriate/unfriendly behaviour of the immunization staff |
a3x7 | Numeric | 1:No, 2:small problem, 3:average problem, 4:big problem | Anything else |
a3a1 | String | text | Please specify |
B1. Medical history and family characteristics of your child | |||
b3st | Numeric | 0:No, 1:Yes | Did your child have state insurance coverage? |
b3pr | Numeric | 0:No, 1:Yes | Did your child have private insurance coverage? |
mage | Numeric | Age of the mother | |
agef | Numeric | Age of the father | |
b5m | Numeric | 1:Kindergarten, 2:Grammar school, 3:High school, 4:College, 5:Higher education, 6:University, 7:other | Mother’s educational level (last academic year completed) |
b5f | Numeric | 1:Kindergarten, 2:Grammar school, 3:High school, 4:College, 5:Higher education, 6:University, 7:other | Father’s educational level (last academic year completed) |
b6 | Numeric | How many persons live in your household (including yourself and your child) | |
b7 | Numeric | 0:No, 1:Yes | Are there other siblings in the household? |
b7ad | Numeric | If YES, how many brother/sisters? |
Although the researchers did have clusters in their survey design (school classes), for simplicity you have just been given the sampling frames (population and number of pupils selected) for the 11 strata (urban and rural areas in each of the 6 regions). Therefore, you can undertake this task as if there were only strata and no clusters in the design.↩︎