Chapter 3 Reading Data into R
This Chapter May Be Skipped
This chapter has details on where the data came from and how I gathered them. If you are not interested in these details, it is fine to skip this chapter.
We are fortunate to be in a time where the technical infrastructure facilitates sharing data in a way that makes publicly available data useful. To newcomers it will seem like a challenging task to find the data, grab the data file and put it in a format that can be used and merged with other data to promote understanding and decision making. Also, working with such complex data at multiple levels, including country, state/province, county, as well as multiple variables, including number of positive tests, number of confirmed covid-19 deaths, and the number of covid-19-related hospitalizations, can be challenging. Then there is the issue of examining explanatory and outcome variables. What information is helpful in accounting for the variability we see across states or across counties? Is it access to health care? poverty rate? compliance to social distancing guidelines? There is also the necessary base rate information that help us understand the counts, such as populations sizes, typical number of ICU visits. This chapter will provide examples for how one goes about accessing such data.
This chapter will cover four different ways to read data into R:
executing a pull from a git repository,
scraping data from the web,
pulling data with an api from the national census, and
reading an excel spreadsheet.
The chapter will also examine some aspects of quality control checks and practices for maintaining consistency across various data sets to permit merging of multiple sources of data. For example, we need to ensure columns in the data file are labeled appropriately. We need to ensure that measurement units and missing data codes are consistent across data sets.
3.1 Pulling data from a git repository
I’ll use the data base that Johns Hopkins has been maintaining on github. It is used by news outlets such as CNN. I already cloned the git repository on my local computer so each day I merely have to issue a “git pull” command to get all new files that have been edited. This is a better process than conducting a completely new download every day of the entire repository. You can see the Johns Hopkins github site for a complete listing of sources of their data.
If you want to learn more about working with git repositories, you can do a Google search for how to clone a repository. Rstudio also has features to help clone and manage a git repository. For basic git information see Appendix A.
The script below will switch to the folder I’m keeping the git repository, execute the command “git pull” (which checks if there have been any updates on the server and, if so, downlooads those updates) and then bring me back to the original folder. To keep things organized I’m keeping the git repository in a separate folder from the R files that are creating these pages. Another way of accomplishing the same thing is to pull the git repository manually each time you want to run the most updated data set. Basically, this next line of code downloads the most recent data for me so I don’t have to do it manually.
I prefer not to include data files in the git repository as a matter of habit (e.g., data may contain private information yet the repository may be public).
We read in the two files with world and US cumulative data.
The column labels of the object datacov.World, for example, are ordered by date starting in column 5. The column names are special string variables because they start with numbers and R doesn’t like variable names to start with numbers. For example,
`1/22/20` (backticks), and you’ll see in later syntax where I need to refer to these columns using the backwards apostrophes. The other columns contain province/state and county/region labels as well as latitude and longitude of each geographic unit.
The column names though sometimes are converted by R to regular strings as in the names() command. The database seems to be updated late in the evening so if these commands are run late in the evening you may see today’s date as the most recent date.
The behavior of the variable names I am showing here follows the default of R on a Mac. It seems that R on a PC follows a different convention: rather than using backticks R adds an X to the beginning of the column name that starts with a nonstandard character a number. To make the PC behave like a Mac on this issue, just add check.names=FALSE in the call to read.csv when the data are read. This way my notes will run on both PC and Mac/linux. I bring this up because when writing code and conducting data analysis you should be mindful of reproducibility across multiple computer platforms. If you claim your code is reproducible, then you should check that it is and provide the appropriate modifications to your code so that it can run on multiple platforms.
##  "UID" "iso2" "iso3" ##  "code3" "FIPS" "Admin2" ##  "Province_State" "Country_Region" "Lat" ##  "Long_" "Combined_Key" "1/22/20" ##  "1/23/20" "1/24/20"
##  "6/19/20" "6/20/20" "6/21/20" "6/22/20" "6/23/20" ##  "6/24/20"
Now I have two datacov files, World and US, and I can use each as needed in subsequent code. The main thing left to “fix” is that the new datacov.US file is broken down by county and my code is currently written for state. Later, I’ll sum over county to get state counts, but will keep the county level information for possible use in later examples using more granular county-level information.
3.1.1 Issues around dealing with downloaded data
While it is great to have an automatic routine that downloads data on a regular basis, there is a potential risk. The data file may change as we saw in this repository on 3/23/20 (for readers following the R code along). There could be more subtle issues such as an error in reading in data that produces a column shift, or the data for one day was entered slightly late after midnight so it shows up as being the next day’s data. Just because one has automatic code, one still should double check results for any weird aspects. An example of something that happened in these data is that data for 3/22/20 was missing when the US data was released on 3/31/20 as was the data for 3/31/20 even though 3/31/20 data were included in the World data file. This was eventually fixed but I could tell something was off because there was missing US data for 3/22/20 and 3/31/20 but that wasn’t the case for the World data.
3.1.2 Issues around dealing with positive tests and death counts
A few words about the Johns Hopkins data set. All we have in the files I downloaded are the counts of confirmed cases. This data set does not include other important information like the number of tests conducted or the population size for each state or country. It makes it difficult to evaluate the total number of confirmed cases without knowing how many tests that unit conducted or its population size. When evaluating changes in counts of confirmed cases, for example, we don’t know if the total counts are increasing because more people are getting sick or if the unit is also increasing the number of tests they conducted. Further, it becomes difficult to compare counts of confirmed covid-19 cases across units without knowing whether the two units have different testing rates, different sample sizes, different population densities, etc. Basically, counts of confirmed cases are quite limited in the information they provide. Below I’ll scrape testing data by US state as well as population totals by US state, and we’ll use these numbers in subsequent analyses. Death counts also have problems and I’ll add mortality rates in a later version of these notes. As highlighted in the Introduction chapter with the London cholera epidemic, counts can still be useful.
3.2 Scraping data from the web example
I’ll illustrate how to scrape data from a web page by pulling a table from this wiki page.
This is a method using commands in the rvest library to read an html page and extract a table, followed by some string manipulations using the package stringr.
There is usually additional manipulation needed to format the table. In this case I had to rename columns, remove characters from the country names dealing with footnotes that appeared in the original page, and other little issues documented in the R code chunks.
If the table on the wiki changes, it could break this code. This already changed in that originally the table I wanted was the 4th table, then it was the 5th table in the html file, then the 6th table, and back to the 5th. I ended up pointing to the table itself rather than the URL of the wiki page.
## location cases deaths recov ## 2 United States 2,465,127 126,218 763,788 ## 3 Brazil 1,233,147 55,054 649,908 ## 4 Russia 613,994 8,605 375,164 ## 5 India 473,105 14,894 271,696 ## 6 United Kingdom 307,980 43,230 No data ## 7 Peru 268,602 8,761 156,074 ## 8 Chile 259,064 4,903 No data ## 9 Spain 247,486 28,330 150,376 ## 10 Italy 239,706 34,678 186,725 ## 11 Iran 215,096 10,130 175,103
This table is ready to use in subsequent analyses. It differs from the datacov Johns Hopkins data.frame I downloaded in the previous section because in addition to counts of confirmed cases it has number of deaths as well as number recovered. It is also not as current as the data on the Johns Hopkins site.
3.3 Using APIs
For later analyses I’ll need to know current population sizes. Here I’ll just focus on US states. The files I downloaded are covid-19 counts by state and country. They do not have population size information, which is important information to evaluate the total number of covid-19 cases in a unit.
This code makes use of the library tidycensus (and also functions in tidyverse). To access the national census data base one needs to register with the census site and receive a unique key. My unique key is not printed here to keep it private. If you want to run this code yourself, you would need to register with the national census and then enter the key you receive as an argument to the census_api_key() command below. See the documentation of the tidycensus package for more information; here is the registration page to get your own key registration
These are the first 10 rows of the state.population data.frame so you can see the result of these R commands. The last column prettyval is a label I created that has the population in millions followed by the two letter state abbreviation. For example, the population of California is 39.6 million and the abbreviation is CA so the corresponding prettyval is “39.6 CA”. This will help labeling plots where we want to identify curves not just by state but also by their respective population sizes.
## # A tibble: 10 x 5 ## NAME GEOID variable value prettyval ## <chr> <chr> <chr> <dbl> <fct> ## 1 Alabama 01 POP 4887871 4.9 AL ## 2 Alaska 02 POP 737438 0.7 AK ## 3 Arizona 04 POP 7171646 7.2 AZ ## 4 Arkansas 05 POP 3013825 3.0 AR ## 5 California 06 POP 39557045 39.6 CA ## 6 Colorado 08 POP 5695564 5.7 CO ## 7 Connecticut 09 POP 3572665 3.6 CT ## 8 Delaware 10 POP 967171 1.0 DE ## 9 Florida 12 POP 21299325 21.3 FL ## 10 Georgia 13 POP 10519475 10.5 GA
3.4 Number tested in US
Another important piece of information we need are the number of tests that have been conducted. Is the total count in a region high or low because the infection rate is high or because more or fewer tests (respectively) have been conducted?
As of 3/21/20 this information has been spotty. We can go to the CDC website to collect this information. The website lists testing data for both CDC and Public Health Labs. After scraping the data off the CDC website I had to do some cleaning, parsing data into numbers as some cells had extra characters that are the footnote symbols appearing on the website, and had to change the format of the date column to be compatible with the date format I’m using elsewhere in these notes. But the table up until 5/11/20 only reported about a tenth of the total test conducted in the US. On 5/12/20 the CDC website changed to report all viral tests and the number of positive test cases.
## Date Collected CDC Labs US Public Health Labs ## 110 6-May 59 21550 ## 111 7-May 102 19573 ## 112 8-May 50 14659 ## 113 9-May 1 3713 ## 114 10-May 10 78 ## 115 11-May 0 1
## CDC Labs US Public Health Labs ## 6275 818682
|TOTAL TESTS REPORTED||31,281,178|
|POSITIVE TESTS REPORTED||2,953,997|
|% OF POSITIVE TESTS||9%|
There is another site I found that reports positive and negative results by state. This is the COVID tracking project. I don’t know how reputable this source is but I’ll keep an eye on it. It reports daily records by state on number of positive, negative, pending, hospitalized, death, and total number of tests. This page points out specifics about each state (e.g., Maryland stopped reporting negative cases as of 3/12, MA numbers include repeated testing on the same individual, not all states include data from commercial labs) showing that it is challenging to interpret such data at the aggregate level when there is so much heterogeneity across units. See, for example, state-level details. As mentioned elsewhere in this document, positive test results need to be interpreted cautiously for several reasons including taking into account the specificity and sensitivity of the tests as well as selection bias in who receives the test. For example, the incidence rate will be biased upward if testing only occurs for people who exhibit symptoms. Death rates have analogous issues, such as some units only report hospital deaths.
It is possible to develop complex models that take into account these various factors and arrive at corrected estimates, but these additional modeling efforts require additional assumptions for which there aren’t always solid data to backing up those assumptions. This is one of many reasons why different modeling groups can arrive at such diverse estimates and forecasts. There are methods that aggregate across multiple models, creating an ensemble of models, which have been used successfully in climate science to predict hurricanes. An effort to use ensemble methods in the case of covid-19 data is underway. I see the value in aggregating over multiple models to make better predictions but a limitation of such ensemble methods is that they do not always inform about the underlying mechanisms. Prediction is one metric used in science, but there are other metrics such as the ability for a model to aid in understanding the underlying mechanisms that are also important. We sometimes want the best prediction of where the eye of a hurricane will make landfall even if we don’t necessarily understand why.
## # A tibble: 6 x 39 ## date state positive negative pending ## <dbl> <chr> <dbl> <dbl> <dbl> ## 1 2.02e7 AK 816 98636 NA ## 2 2.02e7 AL 33206 336252 NA ## 3 2.02e7 AR 18062 259318 NA ## 4 2.02e7 AS 0 696 NA ## 5 2.02e7 AZ 63030 401166 NA ## 6 2.02e7 CA 195571 3498774 NA ## # … with 34 more variables: ## # hospitalizedCurrently <dbl>, ## # hospitalizedCumulative <dbl>, ## # inIcuCurrently <dbl>, inIcuCumulative <dbl>, ## # onVentilatorCurrently <dbl>, ## # onVentilatorCumulative <dbl>, recovered <dbl>, ## # dataQualityGrade <chr>, lastUpdateEt <chr>, ## # dateModified <dttm>, checkTimeEt <chr>, ## # death <dbl>, hospitalized <dbl>, ## # dateChecked <dttm>, totalTestsViral <dbl>, ## # positiveTestsViral <dbl>, ## # negativeTestsViral <dbl>, ## # positiveCasesViral <dbl>, fips <chr>, ## # positiveIncrease <dbl>, negativeIncrease <dbl>, ## # total <dbl>, totalTestResults <dbl>, ## # totalTestResultsIncrease <dbl>, posNeg <dbl>, ## # deathIncrease <dbl>, hospitalizedIncrease <dbl>, ## # hash <chr>, commercialScore <dbl>, ## # negativeRegularScore <dbl>, negativeScore <dbl>, ## # positiveScore <dbl>, score <dbl>, grade <lgl>
3.5 State-level information
Ken Kollman, a political science colleague, pointed me to some relevant sources.
It will be helpful to download relevant data by state or county so we can understand trends we find in the Covid-19 data. I need hypotheses to guide my variable search. There are many variables I could collect on each unit such as population density, rural vs. urban, nature of public transportation offered, etc. I can also gather data around the timing of key covid-related policies such as dates each state issued shelter-in-place rulings, or dates local and state governments issued restrictions on the number of people who could gather (seems some states initially issued numbers of 250, then 100, then 25, then 10, then 2, then finally shelter-in-place). I’m not a political scientist nor a sociologist nor a public health expert nor a policy expert so I don’t have a solid scientific bases on which to generate reasonable hypotheses to test, which would guide my thinking on which data I should get. I was trained to think this way: organize your thinking and then seek the relevant data to test that thinking. The modern approach in data science though turns that upside down: don’t worry about hypotheses, just gather as much information as you can conceive of gathering, clean and organize that information, then run it through special algorithms that simplify the complexity. I’ll include some examples of machine learning approaches to give you a flavor of what this approach has to offer.
3.5.1 School Closure Data
I was able to find site that curates a data base of school closures related to covid-19, which we can use the methods presented in this chapter to gather.
## # A tibble: 6 x 10 ## State `State Abbrevia… `State Status` ## <chr> <chr> <chr> ## 1 Alab… AL Closed ## 2 Alas… AK Closed ## 3 Amer… AS Closed ## 4 Ariz… AZ Closed ## 5 Arka… AR Closed ## 6 Bure… BIE Closures Dete… ## # … with 7 more variables: `Reopening Status` <chr>, ## # `Schools Impacted` <chr>, `State Closure Start ## # Date` <chr>, `State Number of Public ## # Schools` <chr>, `State Public School ## # Enrollment` <chr>, `State Number of Private ## # Schools` <chr>, `State Private School ## # Enrollment` <chr>
3.5.2 State-level Policy Data
A few research groups have started studying the public health implications of various state-level measures around social distancing. One such team is at the University of Washington and their git repository for the most up-to-date information.
## # A tibble: 6 x 16 ## location_id StateFIPS StatePostal StateName ## <dbl> <dbl> <chr> <chr> ## 1 523 1 AL Alabama ## 2 523 1 AL Alabama ## 3 523 1 AL Alabama ## 4 523 1 AL Alabama ## 5 523 1 AL Alabama ## 6 523 1 AL Alabama ## # … with 12 more variables: StatePolicy <chr>, ## # Mandate <dbl>, StateWide <dbl>, DateIssued <dbl>, ## # DateEnacted <dbl>, DateExpiry <dbl>, ## # DateEased <dbl>, DateEnded <dbl>, ## # PolicyCodingNotes <chr>, PolicySource <chr>, ## # LastUpdated <dbl>, LastUpdatedNotes <chr>
Here is a Michigan focus to illustrate the information the file contains.
## # A tibble: 10 x 2 ## DateIssued PolicyCodingNotes ## <dbl> <chr> ## 1 20200310 Expanded to state of emergency and state… ## 2 20200312 Gatherings of more than 250 people are p… ## 3 20200312 <NA> ## 4 20200312 Exec order on April 2 orders all schools… ## 5 20200316 Closure required: casinos, bars, theater… ## 6 20200316 Limited operations required (take-out, d… ## 7 20200316 <NA> ## 8 20200321 Closure required: non-essential personal… ## 9 20200323 All non-essential services closed. Exte… ## 10 20200323 (quote) Subject to the same exceptions, …
3.5.3 State-level poverty percentage
The Department of Agriculture lists percentage of state residents living in poverty. I downloaded the data and saved to a local spreadsheet. It lists both percentage of all residents and percent of children in poverty.
## # A tibble: 6 x 7 ## state percent lower.perc upper.perc children.perf ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Alab… 16.8 16.5 17.1 23.9 ## 2 Alas… 11.1 10.5 11.7 14.5 ## 3 Ariz… 14.1 13.8 14.4 20.4 ## 4 Arka… 16.8 16.3 17.3 23.8 ## 5 Cali… 12.8 12.7 12.9 17.4 ## 6 Colo… 9.7 9.5 9.9 12.1 ## # … with 2 more variables: lower.perc.child <dbl>, ## # upper.perc.child <dbl>
University of Wisconsin’s Neighborhood Atlas area deprivation index. This index is developed at the census block level and assesses income, education, employment, and housing quality. I manually downloaded these data and stored them locally so use the usual R command read.csv() to read this file. I aggregated over block by computing medians for each county in order to connect with other data I’m using here at the state and county levels. The ADI website suggests that the block-level “is considered the closest approximation to a `neighborhood’”.
## # A tibble: 6 x 12 ## # Groups: state.county  ## gisjoin fips adi_natrank adi_staternk state county ## <chr> <chr> <dbl> <dbl> <chr> <chr> ## 1 G01000… 0100… 35 2 01 001 ## 2 G01000… 0100… 96 10 01 003 ## 3 G01000… 0100… 100 10 01 005 ## 4 G01000… 0100… 61 4 01 007 ## 5 G01000… 0100… 82 7 01 009 ## 6 G01001… 0101… 94 9 01 011 ## # … with 6 more variables: state.county <chr>, ## # adi_nat_median <dbl>, adi_st_median <dbl>, ## # state.y <chr>, state_name <chr>, county.y <chr>
3.6 Other data sources
These are other sources that came on-line after I was well into building these notes.
covdata: A git repository that includes some of the data sources I’ve already covered plus some others, such as the CDC Surveillance Network Data, which includes additional information in various tables such as total deaths (i.e., even the non-covid-19 deaths), number of pneumonia deaths, number of influenza deaths, etc., as well as data by age category.
New York City data A git repository with data tables by age, sex, zip code and other useful information.
data.world A data set that can be queried for by state, by gender and by age category; includes number of total deaths, number of covd-19 deaths, number of pneumonia deaths and number of influenza deaths. The site does not appear to be updated frequently.
Just because data files can be created does not mean those data are meaningful. There are differences across these data sources and some data are difficult to compare across states or over time (e.g., reporting of testing differs across states, different types of tests are used across states, the type of testing may have changed during the time interval). The myth is that if you have a large enough data set these kinds of issues will wash out in the aggregate. Maybe random error will wash out, but large data sets do not necessarily eliminate, by virtue of their size alone, systematic differences or biases. The size of the data set can impress and ease one into passive acceptance of a particular data pattern and interpretation. We must exercise the same skepticism we privilege small data sets, and be aware that large data sets may have additional issues not present in well-controlled small-scale studies.
A lot of code was presented in this chapter, but the code is in service of downloading the relevant up-to-date data. I wanted to automate all the steps and write code for all my data manipulations so that you could see what I did and to make the output completely reproducible. One benefit of this approach is that I can run this code every day to get the current information so that all my analyses and plots in this website automatically update. Any workflow process that would require that I manually download a spreadsheet and manually edit that file would not be easily reproducible and would create extra work for me each day when I update this website.
I usually save my workspace at this stage when all the data have been read in and processed, so that subsequent files can just load the workspace and pick up from here. This saves time because the same operations don’t have to be redone. But because these covid-19 data downloads happen often it is better to recreate the workspace to avoid issues with newer data being available. The R functions save.image() and load() usually work well when data remain static, such as after data collection is completed and no further data cleaning is needed.