Participants are expected to be familiar with data management and basic analysis in R and RStudio. It is advisable to have up-to-date versions of both softwares installed. The code in this guide was updated and tested using:
This guide uses packages from the tidyverse
which are
being used more and more for data analysis, due to their easy to read
and efficient coding syntax. The approach for setting up an RStudio
project, importing and cleaning data and managing code, as well as
creating summary tables and graphs follows that presented in the Epidemiologist R handbook (click
on the link to access it).
The packages required for this case study can be installed by running
the code below. It uses the pload()
function from the
pacman
package as this will check the user’s library for
required packages, install them if they are missing, and then load them
into the current R session. This code should therefore be rerun at the
beginning of each session, to ensure that all the required libraries are
loaded.
# Ensures the package "pacman" is installed
if (!require("pacman")) install.packages("pacman", quiet = TRUE)
# Install and/or load required packages
pacman::p_load(
leaflet, # making interactive maps
mapedit, # interactively drawing points and points on maps
sf, # working with spatial data
here, # file paths relative to R project root folder
dplyr, # data management
rio # import data
)
Welcome to the step by step tutorial for estimating population size
using the T-square and Roof-Simple-Random sampling method.
This tutorial was originally developed by Epicentre and benefits from
work carried out by Epicentre to estimate the population size of Kario
refugee camp (Darfur, Sudan) in 2019.
Alex Spina and Alex Blake (Applied Epi) adapted to the case study to be
fully usable in R as a stand-alone case study, and added the T-square
method. All original licenses for the epicentre case study apply here
too.
Kario refugee camp was created in 2017. In the following two years
there was a high population movement in and out of the camp. In 2019
available population figures were considered unreliable.
Apart from an exhaustive census, which requires resources and time,
rapid methods, such as the T-squares and Quadrat method, have been
developed and used to estimate population size.
More recently, on several occasions, we have used a method that combines satellite image analysis and a field survey. The principle of this method is:
To present the precision and confidence interval of the estimate, we assumed that the number of people per roof follows a Poisson distribution. However, it is likely that many roofs identified during the scanning of the satellite image do not cover dwellings (i.e. animal shelters, shops, etc.) where nobody slept the night before the survey.
MSF in the field in Kario was able to mark waypoints on the perimeter of the camp. MSF headquarters got access to a recent satellite image where the refugee camp was well visible. These two sources of information allows to apply the Roof-Simple-Random sampling method to estimate the population size of Kario refugee camp.
At the end of this tutorial, participants will be able:
Make sure you are set with:
R and Rstudio installed - as well as the necessary packages
The following files downloaded to your computer:
A smartphone with OsmAnd application installed or a GPS device (e.g. Garmin eTREX)
The tutorial is expected to last 1 hour (50 min scheduled time + 10 min buffer), with the main steps as follows:
The objective of the first step is to locate and geographically delimit the area of the refugee camp by drawing (or uploading) a polygon.
We use the leaflet
package for interactive mapping.
Below we centre our map to the longitude and latitude points of Kario
refugee camp. You can find these approximate coordinates by going to Open Street Map and searching
for Kario camp. Then navigating to the location shown where there are
buildings, right clicking and then selecting show address. You
can move around and zoom in to the leaflet map below with your
mouse.
kario_camp <- leaflet() %>%
# start an interactive plot (of the whole world)
addTiles() %>%
# center the leaflet map on the middle of our area of interest
setView(
lng = 26.208,
lat = 11.1496,
zoom = 15) %>%
# add in background images of the area from open street map
# we use the humanitarian version for this setting as there is more info
addProviderTiles("OpenStreetMap.HOT")
You can view and interact with your map simply by calling the object you saved to, as below.
# view your map interactively in browser
kario_camp
We use the mapedit
package to draw a polygon
interactively. We use the leaflet map created just above and with
mapedit
you can “draw” your polygon over the area of
interest. Click on the hexagon to draw a polygon. Zoom in on an area and
draw a polygon around it by selecting points to connect in to a shape.
(note that once you start selecting points you can still move the screen
around by using both left-and-right click simultaneously). When you are
done, click on “Finish”. Then click on “Done” on the bottom right. Have
a quick look at the mapedit
website if you need additional help.
# interactively draw a shape on your map
# your polygon is saved in the object "polygon" (which is a list)
# more specifically in the slot "finished" of the object polygon.
polygon <- editMap(kario_camp)
# access your polygon
polygon <- polygon$finished
This is an alternative step that replaces polygon drawing in the previous task
Instead of drawing the polygon you can read in a polygon to R from an existing shapefile.
We read in shapefiles with the sf
package, which stands
for simple features, and makes interacting with spatial data much
simpler!
We navigate (as in previous case studies) to our file using the
here
package, and specify the .shp
file. While
a shapefile consists of many more files than just the .shp
,
if we tell R where that is, it then knows to pull in the others.
# read in your shapefile
polygon <- read_sf(here::here("data",
"SDN_Kario_17Feb2019_polygon",
"SDN_Kario_17Feb2019_polygon.shp"))
You could then view the polygon to your existing map.
kario_camp <- kario_camp %>%
# plot your polygon on top of the existing tiles
addPolygons(
# specify your sf object
data = polygon,
# color the border red
color = "red",
# make the middle see-through
fillOpacity = 0)
# view your map
kario_camp
The objective of this step is to mark the geographical coordinates of
all roofs by scanning the satellite image delimited by the polygon. This
step is only necessary for doing the Roof-Simple-Random method. Also
note that with R you can download polygons of buildings directly from
Open Street Maps, using the osmdata
package. We have not
done this in the case study but a chapter is being added to the
epiRhandbook which will have a walk-through.
Here again we can use the mapedit()
package, but this
time we will draw points instead of polygons.
As above, We use the leaflet map created just above and with
mapedit
you
can “draw” your points over each of the rooftops in the area of
interest. Click on the point icon. (remember that once you start
selecting points you can still move the screen around by using both
left-and-right click simultaneously). When you are done, click on
“Finish”. Then click on “Done” on the bottom right. Have a quick look at
the mapedit
website if you need additional help.
# interactively draw a shape on your map
# your points are saved in the object "households" (which is a list)
# more specifically in the slot "finished" of the object households
households <- editMap(kario_camp)
## access your points
households <- households$finished
This is an alternative step that replaces marking roofs in the previous task
The full roofs identification of the refugee camp was in fact carried
out semi-automatically, manually validated and saved in a shapefile. As
with the polygon above we can use the sf
package to read in
the points to R.
An additional complication here is that the points are not stored as
longitude and latitude, but as x and y coordinates. For this we use the
st_transform()
function to re-project the data to the
appropriate format. We have to specify a coordinate reference system,
and WGS84 is the standard one used by most mapping services online
(e.g. Open Street Maps).
# read in your shapefile
households <- read_sf(here::here("data",
"SDN_Kario_17Feb2019_roofs",
"SDN_Kario_17Feb2019_roofs.shp")) %>%
# re-project x and y coordinates to be longitude and latitude.
# note that crs 4326 is a code for WGS84
st_transform(crs = 4326)
You could then view the polygon to your existing map.
kario_camp %>%
# add circles for each point
# (zoom in to see how they fit with rooftops)
addCircleMarkers(data = households)
The objective of this step is to obtain a random selection of points (for T-Square), or roofs (for Roof-Simple-Random) that will be visited during the field survey. Sample geographical coordinates can be saved and exported in different file formats.
We can also use the st_sample()
function to draw a list
of random points from within our camp polygon.
The precision of the estimates depends on the sample size. The higher the number of sampled unites the better the precision. For the T-squares method it is generally recommended to take at least 50 points. For the sake of this exercise, we asked you to sample 50 points. In the real Kario survey, 400 roofs were sampled in order to do Roof-Simple-Random.
# select 50 random points from your polygon (i.e. not necessarily households)
random_points <- st_sample(polygon, 50) %>%
# change back to a simple features dataframe (sampling made it a "collection")
st_as_sf()
# Visualise the selected points
kario_camp %>%
# add circles for each point
# (zoom in to see how they fit with rooftops)
addCircleMarkers(data = random_points, color = "green")
We can also use the st_sample()
function to draw a list
of random points from our household points. Note that this could also be
done for a polygon, and in some cases might be more appropriate. For
example if you had a multi-polygon, which would be a box around each
house (e.g. from osmdata
), then sampling points from that
would give the larger households higher chance of being selected,
whereas sampling from points does not account for this. While it is not
the case in this tutorial, in other situations you might have more than
one polygon and need to sample roofs separately for each polygon.
Notice here when you visualise that the points are much closer to where the population is actually situated in the camp (because you are sampling households rather than any space within the camp boundary) - this likely makes conducting your survey logistically simpler. In addition, you only need to survey half as many houses compared to T-Square (because you don’t need to find the next closest house).
# select 50 random points from your households
random_roofs <- st_sample(households, 50) %>%
# transform back to points
# (this comes out as a multipoint object, which is too complex)
st_cast("POINT") %>%
# set the coordinate reference system for the points
# note that crs 4326 is a code for WGS84
st_set_crs(4326) %>%
# change back to a simple features dataframe (multipoint made it a "collection")
st_as_sf()
# Visualise the selected points
kario_camp %>%
# add circles for each point
# (zoom in to see how they fit with rooftops)
addCircleMarkers(data = random_roofs, color = "orange")
The objective of this step is to get your sampled points from R on to a mobile phone or GPS device to be able to do your survey.
We can use sf
functions to save the sampled points in to
a variety of different formats which can then be used on mobile phones
with OsmAnd or a GPS device such as a Garmin eTrex. We have
only exported the random points (rather than the random roofs), but the
same code can be used for either. To save our points we use the
st_write()
function from the sf
package.
# export sampled points to gpx
random_points %>%
# gpx has a specific format which it accepts
# to meet this format we have to give a name to label each of the points
# note that this needs to be a lower case "name"
# we consecutively number the points by combining "point" with the number of the row
mutate(name = paste0("point-", 1:n())) %>%
st_write(
# define the file path and extension (gpx)
dsn = here::here("data", "random_points.gpx"),
# define which format to use (this could be guessed by the function)
driver = "GPX",
# overwrite any previously existing files.
delete_dsn = TRUE)
We could similarly use st_write()
to save to a
.shp
or a .kml
(used in Google Earth).
# export sampled points to shp
random_points %>%
# it can be useful to give a name to label each of the points
# we consecutively number the points by combining "point" with the number of the row
mutate(Name = paste0("point-", 1:n())) %>%
st_write(
# define the file path and extension (kml)
dsn = here::here("data", "random_points.shp"),
# overwrite any previously existing files.
delete_dsn = TRUE)
# export sampled points to kml
random_points %>%
# kml has a specific format which it accepts
# to meet this format we have to give a name to label each of the points
# note that this needs to be a lower case "Name"
# we consecutively number the points combining "point" with the number of the row
mutate(Name = paste0("point-", 1:n())) %>%
st_write(
# define the file path and extension (kml)
dsn = here::here("data", "random_points.kml"),
# define which format to use (this could be guessed by the function)
driver = "kml",
# overwrite any previously existing files.
delete_dsn = TRUE)
You can transfer your gpx
file to osmand by connecting
with a usb cable then copying the file to the following folder (on an
android phone). This is probably the best option, particularly in
settings with low internet connectivity (where you would be doing this
type of survey).
PC>Mobile>Internal shared storage>Android>data>net.osmand>files>tracks
When on OsmAnd, navigate to my place and then click on tracks.
Send the file to yourself via an email that you can access on your mobile. Download the file. Go to the downloads folder of the mobile phone and open the file with the OsmAnd app.
Connect GPS device to the computer using the USB connection. The GPS
device should appear as a computer device.
Using windows explorer go to the root of the computer (bside C: or D:)
and open the Garmin eTrex device until the …/GPX/… folder.
Copy-paste the saved gpx file from your computer hard drive to the GPX
folder of the Garmin eTrex device.
The objective of this step is to calculate your population estimates using T-square and then Roof-Simple-Random.
We are now going to pretend that the field survey has been completed. The closest households (GPS points) and number of people (<5 years and >= 5 years) who slept in each household the night before was recorded for the 50 sampled points.
You could have physically measured the distance between households, but it is easier to just take a GPS point for each household visited and then use R to calculate the distance.
This would normally be recorded in a data collection app (such as
Kobo) or entered to some other form or database. For simplicity, we have
already entered this data in to an excel sheet for you to import
(survey_data_t_squares.xlsx
). We import (as in previous
case studies) using the rio
package’s import
function.
## import your survey data
survey_data <- rio::import(here::here("data", "survey_data_t_squares.xlsx"))
The formula for estimating population using T-Square is below. We will be walking through each of the parts needed for calculation step-by-step.
\[ N = n \times \frac {S}{\gamma} \]
Where:
N = The population estimate
n = Average number of people per household
S = Area of the region of interest
\(\gamma\) = Average area per
household
Note that S over \(\gamma\) gives us the total number of households in the camp.
The easiest step is to calculate the average number of people per household (n). To do this we use the counts of people for all the households in our survey. We will do this once for the overall and once for the under five year olds.
## combine the two overall household counts
overall_avg <- c(survey_data$num_ppl,
survey_data$num_ppl_h2) %>%
## return the mean
mean()
## combined the two under five household counts
under5_avg <- c(survey_data$num_ppl_under_five,
survey_data$num_ppl_under_five_h2) %>%
## calculate the mean
mean()
## view your numbers
overall_avg
## [1] 3.82
under5_avg
## [1] 2.86
Next, we need to calculate the area of the region of interest
(S), in our case the area of Kario camp (in
m2). We can do this using the st_area()
function form sf
.
## calculate the area of within the polygon and return as a numeric
## otherwise it is returned as a measurement (with m2 on the end)
camp_area <- st_area(polygon) %>%
as.numeric()
## view your numbers
camp_area
## [1] 1301761
Next, we need to work out the average area per household (\(\gamma\)) in (in
m2).
To do this we first need to calculate the area for each household
individually, and then work out the average. The formulas for this are
below.
\[
Area = \frac {\pi \times (\frac{d1^2 + d2^2}{2})} {2}
\] Where:
d1 = Distance between random sample point and nearest household
(in m)
d2 = Distance between first household and the next nearest
household (in m)
We can then use the areas calculated above to plug in to the formula for the average area per household (\(\gamma\)) below.
\[
\gamma = \frac {\pi \times \sum_{1}^{i} (\frac{d1^2 + d2^2}{2})} {2
\times i}
\] where:
i = The number of points (in our case 50)
Our first step will be to calculate distances between our
points.
We need to turn our GPS points in to longitude and latitude, Kobo simply
stores these as numbers separated by a comma (which is interpreted as
text).
As a side-note: there would be ways to do this within your
survey_data
data frame but because sf
does not
allow multiple geometry columns in one object, it becomes
complex. So instead we just define three separate point objects.
## extract the first randomly sampled points
first_point <- survey_data %>%
select(gps_point) %>%
## split in to a latitude and longitude column
tidyr::separate(gps_point,into = c("lat","lon"), sep=", ") %>%
## change columns to numeric
mutate(across(c(lat, lon), as.numeric)) %>%
## set as sf points and define the crs
st_as_sf(coords = c("lat","lon"), crs = 4326)
## extract the first randomly sampled points
second_point <- survey_data %>%
select(nearest_house_gps_point) %>%
tidyr::separate(nearest_house_gps_point,into = c("lat","lon"), sep=", ") %>%
mutate(across(c(lat, lon), as.numeric)) %>%
st_as_sf(coords = c("lat","lon"), crs = 4326)
## extract the first randomly sampled points
third_point <- survey_data %>%
select(second_house_gps_point) %>%
tidyr::separate(second_house_gps_point,into = c("lat","lon"), sep=", ") %>%
mutate(across(c(lat, lon), as.numeric)) %>%
st_as_sf(coords = c("lat","lon"), crs = 4326)
Then we need to calculate the distances between our household GPS
points, we can do this using the st_distance()
function
from sf
.
## add distance columns to your dataset
survey_data <- survey_data %>%
mutate(
## calculate the distance between points
first_distance = st_distance(first_point, second_point, by_element = TRUE),
second_distance = st_distance(second_point, third_point, by_element = TRUE)
)
Next we can use these distances to calculate the average area per
household. We can skip having to work out the area per household and
just use a summarise()
from dplyr
with R’s
inbuilt calculator functions to calculate \(\gamma\) (note that \(\pi\), is defined in R already).
avg_area <- survey_data %>%
summarise(
## use the formula to calculate average area for 50 points
avg_area = (pi * sum( (first_distance^2 + second_distance^2) / 2) ) / (2 * 50)
) %>%
## extract the result
pull() %>%
## as a number
as.numeric()
## view your numbers
avg_area
## [1] 2926.169
In our final step we can plug in all of the parts we calculated above in to our formula for T-Square and estimate the population. We do this once for the overall, and once for the under fives. We also demonstrate how to calculate a basic confidence interval using a poisson distribution. As stated above, a zero-inflated distribution would be more appropriate; however there is no simple way of doing this with R as of yet, so for the sake of this case study we have demonstrated as below. We define our own function to pull the lower and upper confidence intervals together in to one column (this might seem scary at first, but it is really just the same as passing arguments to any other code! It also saves you time repeating yourself).
## use the T-square formula to calculate an overall population estimate
N_overall <- overall_avg * (camp_area / avg_area)
## use the T-square formula to calculate an under 5 population estimate
N_under5 <- under5_avg * (camp_area / avg_area)
## view your numbers
N_overall
## [1] 1699.399
N_under5
## [1] 1272.325
## function to pull together confidence intervals
## this function takes an argument `x` and places replaces x below wherever called
## x in this case is your population estimate (number) that you have calculated
intervals <- function(x) {
## take your population estimate (x) and run a poisson test
## this returns a list and using $ we can extract only the CIs
estimates <- poisson.test(as.integer(x))$conf.int
## round to 1 decimal place
estimates <- round(estimates, digits = 1)
## combine the lower and upper interval in to one character string
## return (or print from the function)
## this way you could use the function within a dplyr::mutate to add new cols
paste0(estimates[1], " - ", estimates[2])
}
## get the poisson confidence intervals for overall population estimate
intervals(N_overall)
## [1] "1619.2 - 1781.8"
## get the poisson confidence intervals for the under five population estimate
intervals(N_under5)
## [1] "1203 - 1343.9"
We are now going to pretend that the field survey has been completed.
The number of people (<5 years and >= 5 years) who slept in each
household the night before was recorded for the 50 sampled points. This
would normally be recorded in a data collection app (such as Kobo) or
entered to some other form or database. For simplicity, we have already
entered this data in to an excel sheet for you to import
(survey_data_roof_random.xlsx
). We import (as in previous
case studies) using the rio
package’s import
function.
## import your survey data
survey_data <- rio::import(here::here("data", "survey_data_roof_random.xlsx"))
Calculating a population estimate with the Roof-Simple-Random method is, by comparison to T-square, much simpler. It only requires the average number of people per household by the total number of households in the region of interest.
\[ N = n \times H \]
Where:
N = The population estimate
n = Average number of people per household
H = Total number of households in the region of interest
We have all the information we need to make this straightforward
calculation. We can use the summarise()
function from
dplyr
to calculate the mean for each of the age groups
(using an across()
as previously). Then we have the total
number of households from our shapefile, we only need to count the
number of rows (remember each point is a row, and so every row is a
household), using the base
function nrow()
.
Then we just multiply and apply the same poisson distribution function
to calculate confidence intervals as above.
## calculate the average number per house of children under and over five years old
averages <- survey_data %>%
mutate(overall = under_five + five_plus) %>%
summarise(across(c(under_five, five_plus, overall), mean))
## pull the number of households from the shapefile
num_hh <- nrow(households)
## multiply the averages by the number of houses
population <- averages * num_hh
## function to pull together confidence intervals
intervals <- function(x) {
estimates <- poisson.test(as.integer(x))$conf.int
estimates <- round(estimates, digits = 1)
paste0(estimates[1], " - ", estimates[2])
}
## get the poisson confidence intervals for each of the population estimates
conf_ints <- population %>%
mutate(across(everything(), intervals))
## view your numbers
population
## under_five five_plus overall
## 1 2719.85 9161.6 11881.45
conf_ints
## under_five five_plus overall
## 1 2617.8 - 2823.2 8974.4 - 9350.5 11668.3 - 12096.6
For the next exercises - you might be asked to go outside and record
the boundary of an area using your mobile phone - OsmAnd then creates a
.gpx
file. You can read in these files with the following
code.
## read in a track from osmand
polygon <- st_read(
dsn = here::here("data", "track.gpx")
)