Chapter 3 Reading Data into R

This Chapter May Be Skipped

This chapter has details on where the data came from and how I gathered them. If you are not interested in these details, it is fine to skip this chapter.

We are fortunate to be in a time where the technical infrastructure facilitates sharing data in a way that makes publicly available data useful. To newcomers, it will seem like a challenging task to find the data, grab the data file and put it in a format that can be used and merged with other data to promote understanding and decision making. Also, working with such complex data at multiple levels, including country, state/province, county, as well as multiple variables, including number of positive tests, number of confirmed covid-19 deaths, and the number of covid-19-related hospitalizations, can be challenging. Then there is the issue of examining explanatory and outcome variables. What information is helpful in accounting for the variability we see across states and counties? Is it access to health care? poverty rate? compliance to social distancing guidelines? There is also the relevant base rate information that help us understand the counts, such as population sizes and typical number of ICU visits. This chapter will provide examples for how one goes about accessing, downloading and processing such data.

This chapter will cover four different ways to read data into R:

  1. executing a pull from a git repository,

  2. scraping data from the web,

  3. pulling data with an api from the national census, and

  4. reading an excel spreadsheet.

The chapter will also examine some aspects of quality control checks and practices for maintaining consistency across various data sets to permit merging of multiple sources of data. For example, we need to ensure columns in the data file are labeled consistently. We also need to ensure that measurement units and missing data codes are consistent across data sets.

Comments about R code

These notes to serve as a tutorial for how to work with data. My goal is not to teach data science or statistics in these notes but rather to illustrate how to work with data and highlight some pros and cons of different approaches. I will emphasize the code so that interested readers can see what I did. You are free to take this code and use it for your own purposes.

My style of writing code is idiosyncratic as I suspect it is for everyone who writes code. R is a complex language and like most languages there are various dialects. I have learned many of those dialects and use them frequently, sometimes mixing and matching the dialects to create hybrids. I’ll try to maintain some consistency but can’t promise I’ll succeed. Desiderata include

  1. use ggplot2 as much as possible for plots rather than the original base R plotting (though sometimes I can’t resist)

  2. write code that is robust to possible changes in the data (especially because new data come in daily), and

  3. write code that is easy to read and follow even though it may not be efficient code.

Sometimes these present conflicts such as writing robust code may not be code that is easy to follow.

3.1 Pulling data from a git repository

I’ll primarily use the data base that Johns Hopkins University (JHU) has been maintaining on github. It is used by news outlets such as CNN. I already cloned the git repository on my local computer so each day I merely have to issue a “git pull” command to get all new files that have been edited since the last ``pull.’’ This is a more efficient workflow than conducting a completely new download every day of the entire repository. You can see the Johns Hopkins github site for a complete listing of sources of their data.

The program git is a tool for maintaining and distributing software versions. For basic git information see Appendix A. If you want to learn more about working with git repositories, you can do a Google search for how to clone a repository. Rstudio also has features to help clone and manage a git repository. Git is a better approach than placing all files in a shared folder because it provides version control and makes it relatively easy to remain up-to-date with changes in the software.

The script below will switch to the folder I’m keeping the JHU data git repository, execute the command “git pull” (which checks if there have been any updates on the server and, if so, downlooads those updates) and then bring me back to the original folder. To keep things organized I’m keeping the data git repository in a separate folder from the R files that are creating these pages. Another way of accomplishing the same thing is to pull the git repository manually each time you want to run the most updated data set. Basically, this next line of code downloads the most recent data so I don’t have to do it manually every time I want the most recent data pull.

I prefer not to include data files in the git repository that has the code as a matter of habit (e.g., data may contain private information yet the repository with the code may be public).

We read in the files with world and US cumulative data, each for cases and deaths.

The column labels of the object datacov.World, for example, are ordered by date starting in column 5. The column names are special string variables because they start with numbers and R doesn’t like variable names to start with numbers. For example, `1/22/20` (backticks), and you’ll see in later syntax where I need to refer to these columns using the backwards apostrophes. The other columns contain province/state and county/region labels as well as latitude and longitude of each geographic unit.

The column names though sometimes are converted by R to regular strings as in the names() command. The database seems to be updated late in the early morning hours so if these commands are run in the evening you may not see today’s date as the most recent date.

##  [1] "UID"            "iso2"           "iso3"          
##  [4] "code3"          "FIPS"           "Admin2"        
##  [7] "Province_State" "Country_Region" "Lat"           
## [10] "Long_"          "Combined_Key"   "1/22/20"       
## [13] "1/23/20"        "1/24/20"
## [1] "4/21/21" "4/22/21" "4/23/21" "4/24/21" "4/25/21"
## [6] "4/26/21"

Now I have two datacov files, World and US, and I can use each as needed in subsequent code. The main thing left to “fix” is that the new datacov.US file is broken down by county and my code is currently written for state. Later, I’ll sum over county to get state counts, but will keep the county level information for possible use in later examples using more granular county-level information.

3.1.1 Issues around dealing with downloaded data

While it is great to have an automatic routine that downloads data on a regular basis, there is a potential risk. The data file may change its structure (for readers following the R code, this happened on 3/23/20). There could be more subtle issues such as an error in reading in data that produces a column shift, or the data for one day was entered slightly late after midnight so it shows up as being the next day’s data. Just because one has automatic code, one still should double check results for any weird aspects. For example, US data for 3/22/20 was missing when the US data was released on 3/31/20 as was the data for 3/31/20 even though 3/31/20 data were included in the World data file. This issue was eventually fixed but my graphs and analyses were off for a few days.

3.1.2 Issues around dealing with positive tests and death counts

A few words about the Johns Hopkins data set. All we have in the files I downloaded are the counts of confirmed cases. This data set does not include other important information like the number of tests conducted or the population size for each state or country. It makes it difficult to evaluate the total number of confirmed cases without knowing how many tests that unit conducted or its population size. When evaluating changes in counts of confirmed cases, for example, we don’t know if the total counts are increasing because more people are getting sick or if the unit is also increasing the number of tests they conducted. Further, it becomes difficult to compare counts of confirmed covid-19 cases across units without knowing whether the two units have different testing rates, different sample sizes, different population densities, etc. Basically, counts of confirmed cases are quite limited in the information they provide. Below, I’ll scrape testing data by US state as well as population totals by US state, and we’ll use these numbers in subsequent analyses. Death counts also have problems and I’ll add mortality rates in a later version of these notes. As highlighted in the Introduction chapter with the London cholera epidemic, counts are useful as long as one treats them carefully.

3.2 Scraping data from the web example

I’ll illustrate how to scrape data from a web page by pulling a table from this wiki page.

This is a method using commands in the rvest R package to read an html page and extract a table, followed by some string manipulations using the package stringr.

There is usually additional manipulation needed to format the table. In this case I had to rename columns, remove characters from the country names dealing with footnotes that appeared in the original page, and other little issues documented in the R code chunks.

If the table on the wiki changes, it could break this code. This already changed in that originally the table I wanted was the 4th table, then it was the 5th table in the html file, then the 6th table, and back to the 5th. I ended up pointing to the table itself rather than the URL of the wiki page.

This table is ready to use in subsequent analyses. It differs from the datacov Johns Hopkins data.frame I downloaded in the previous section because in addition to counts of confirmed cases it has number of deaths as well as number recovered. It is also not updated as frequently as the Johns Hopkins site.

3.3 Using APIs

For later analyses I’ll need to know current population sizes. I’ll just focus on the population of the US states. The files I downloaded are covid-19 counts by state and country. They do not have population size information, which is important information to evaluate the total number of covid-19 cases in a unit.

This code makes use of the tidycensus package (and also functions in tidyverse package). To access the national census data base one needs to register with the census site and receive a unique key. My unique key is not printed here to keep it private. If you want to run this code yourself, you would need to register with the national census and then enter the key you receive as an argument to the census_api_key() command below. See the documentation of the tidycensus package for more information; here is the registration page to get your own key registration

These are the first 10 rows of the state.population data.frame so you can see the result of these R commands. The last column prettyval is a label I created that has the population in millions followed by the two letter state abbreviation. For example, the population of California is 39.6 million and the abbreviation is CA so the corresponding prettyval is “39.6 CA”. This will help labeling plots where we want to identify curves not just by state but also by their respective population sizes.

## # A tibble: 10 x 5
##    NAME           GEOID variable    value prettyval
##    <chr>          <chr> <chr>       <dbl> <fct>    
##  1 Mississippi    28    POP       2976149 3.0 AL   
##  2 Missouri       29    POP       6137428 6.1 AK   
##  3 Montana        30    POP       1068778 1.1 AZ   
##  4 Nebraska       31    POP       1934408 1.9 AR   
##  5 Nevada         32    POP       3080156 3.1 CA   
##  6 New Hampshire  33    POP       1359711 1.4 CO   
##  7 New Jersey     34    POP       8882190 8.9 CT   
##  8 New Mexico     35    POP       2096829 2.1 DE   
##  9 New York       36    POP      19453561 19.5 FL  
## 10 North Carolina 37    POP      10488084 10.5 GA

3.4 Number tested in US

Another important piece of information we need are the number of tests that have been conducted. Is the total count in a region high or low because the infection rate is high or because more or fewer tests (respectively) have been conducted?

Unfortunately, this information has been spotty. We can go to the CDC website to collect this information. The website lists testing data for both CDC and Public Health Labs. After scraping the data off the CDC website I had to do some cleaning, parsing data into numbers as some cells had extra characters that are the footnote symbols appearing on the website, and had to change the format of the date column to be compatible with the date format I’m using elsewhere in these notes. But until 5/11/20 the CDC reported only about a tenth of the total test conducted in the US. On 5/12/20 this CDC website stopped updating the counts.

Date Collected CDC Labs US Public Health Labs
6-May 59 21550
7-May 102 19573
8-May 50 14659
9-May 1 3713
10-May 10 78
11-May 0 1
##              CDC Labs US Public Health Labs 
##                  6275                818682

As of November 2020, the CDC website has changed to a different format. I have decided not to use the CDC website given the uncertainties in its reporting and the constant changes to the website structure, which means I need to continually edit my code to accommodate those changes.

There is another site I found that reports positive and negative results by state. This is the COVID tracking project. I don’t know how reputable this source is but I’ll keep an eye on it. It reports daily records by state on number of positive, negative, pending, hospitalized, death, and total number of tests.

## # A tibble: 6 x 56
##      date state positive probableCases negative pending
##     <dbl> <chr>    <dbl>         <dbl>    <dbl>   <dbl>
## 1  2.02e7 AK       56886            NA       NA      NA
## 2  2.02e7 AL      499819        107742  1931711      NA
## 3  2.02e7 AR      324818         69092  2480716      NA
## 4  2.02e7 AS           0            NA     2140      NA
## 5  2.02e7 AZ      826454         56519  3073010      NA
## 6  2.02e7 CA     3501394            NA       NA      NA
## # … with 50 more variables:
## #   totalTestResultsSource <chr>,
## #   totalTestResults <dbl>,
## #   hospitalizedCurrently <dbl>,
## #   hospitalizedCumulative <dbl>,
## #   inIcuCurrently <dbl>, inIcuCumulative <dbl>,
## #   onVentilatorCurrently <dbl>,
## #   onVentilatorCumulative <dbl>, recovered <dbl>,
## #   lastUpdateEt <chr>, dateModified <dttm>,
## #   checkTimeEt <chr>, death <dbl>,
## #   hospitalized <dbl>, hospitalizedDischarged <dbl>,
## #   dateChecked <dttm>, totalTestsViral <dbl>,
## #   positiveTestsViral <dbl>,
## #   negativeTestsViral <dbl>,
## #   positiveCasesViral <dbl>, deathConfirmed <dbl>,
## #   deathProbable <dbl>,
## #   totalTestEncountersViral <dbl>,
## #   totalTestsPeopleViral <dbl>,
## #   totalTestsAntibody <dbl>,
## #   positiveTestsAntibody <dbl>,
## #   negativeTestsAntibody <dbl>,
## #   totalTestsPeopleAntibody <dbl>,
## #   positiveTestsPeopleAntibody <dbl>,
## #   negativeTestsPeopleAntibody <dbl>,
## #   totalTestsPeopleAntigen <dbl>,
## #   positiveTestsPeopleAntigen <dbl>,
## #   totalTestsAntigen <dbl>,
## #   positiveTestsAntigen <dbl>, fips <chr>,
## #   positiveIncrease <dbl>, negativeIncrease <dbl>,
## #   total <dbl>, totalTestResultsIncrease <dbl>,
## #   posNeg <dbl>, dataQualityGrade <lgl>,
## #   deathIncrease <dbl>, hospitalizedIncrease <dbl>,
## #   hash <chr>, commercialScore <dbl>,
## #   negativeRegularScore <dbl>, negativeScore <dbl>,
## #   positiveScore <dbl>, score <dbl>, grade <lgl>

This page points out specifics about each state (e.g., Maryland stopped reporting negative cases as of 3/12, MA numbers include repeated testing on the same individual, not all states include data from commercial labs) showing that it is challenging to interpret such data at the aggregate level when there is so much heterogeneity in reporting strategies across units. See, for example, state-level details. As mentioned elsewhere in this document, positive test results need to be interpreted cautiously for several reasons, including taking into account the specificity and sensitivity of the tests and the selection bias in who receives the test. For example, the incidence rate will be biased upward if testing only occurs for people who exhibit symptoms. Death rates have analogous issues, such as some units only report hospital deaths and not deaths of people recovered from their homes.

It is possible to develop complex models that take into account these various factors and arrive at corrected parameter estimates, but these additional modeling efforts require additional assumptions for which there aren’t always solid data to back up those assumptions. This is one of many reasons why different modeling groups can arrive at such diverse estimates and forecasts. There are methods that aggregate across multiple models, creating an ensemble of models, which have been used successfully in climate science to predict hurricanes. An effort to use ensemble methods in the case of covid-19 data is underway. I acknowledge the value in aggregating over multiple models to make better predictions but a limitation of such ensemble methods is that they do not always inform about the underlying mechanisms.

Prediction is one metric used in science, but there are other metrics that are also important such as the ability of a model to provide understanding of the underlying mechanisms. We may want the best prediction of where and when the eye of a hurricane will make landfall even if we don’t necessarily understand how the model works. For getting the right people to evacuate at the right time to stay out of harms way all we need are good predictions of the path of the hurricane. But in other settings if we want to implement a successful intervention, such as develop a vaccine, we need more than just prediction. We need models that provide understanding of the underlying processes.

3.5 State-level information

Ken Kollman, a political science colleague, pointed me to some relevant state-level datasets.

It will be helpful to download relevant data by state or county so we can understand trends we find in the Covid-19 data. I need hypotheses to guide my variable search. There are many variables I could collect on each unit such as population density, rural vs. urban, nature of public transportation offered, etc. I can also gather data around the timing of key covid-related policies such as dates each state issued shelter-in-place rulings, or dates local and state governments issued restrictions on the number of people who could gather (seems some states initially issued numbers of 250, then 100, then 25, then 10, then 2, then finally shelter-in-place). I’m not a political scientist nor a sociologist nor a public health expert nor a policy expert so I don’t have a solid scientific bases on which to generate reasonable hypotheses to test, which would guide my thinking on which data I should get. I was trained to think this way: organize your thinking and then seek the relevant data to test the implications of that thinking. The modern approach in data science though turns that upside down: don’t worry about hypotheses, just gather as much information as you can conceive of gathering, clean and organize that information, then run it through special algorithms that simplify the complexity. I’ll include some examples of machine learning approaches to give you a flavor of what this approach has to offer. But the desiderata of being guided by hypotheses to test and models to compare has been imprinted in my mental model of the scientific approach in such a deep way that I can’t shake it away.

Here is a complete list of state quarantine and isolation statutes by the National Conference of State Legislatures.

3.5.1 School Closure Data

I was able to find a site that curates a data base of school closures related to covid-19. Unfortunately, this site stopped updating data in May 2020.

## # A tibble: 6 x 10
##   State       `State Abbreviat… `State Status`         
##   <chr>       <chr>             <chr>                  
## 1 Alabama     AL                Closed                 
## 2 Alaska      AK                Closed                 
## 3 American S… AS                Closed                 
## 4 Arizona     AZ                Closed                 
## 5 Arkansas    AR                Closed                 
## 6 Bureau of … BIE               Closures Determined at…
## # … with 7 more variables: Reopening Status <chr>,
## #   Schools Impacted <chr>,
## #   State Closure Start Date <chr>,
## #   State Number of Public Schools <chr>,
## #   State Public School Enrollment <chr>,
## #   State Number of Private Schools <chr>,
## #   State Private School Enrollment <chr>

3.5.2 State-level Policy Data

A few research groups have started studying the public health implications of various state-level measures around social distancing. One such team is at the University of Washington; see their git repository for the most up-to-date information.

## # A tibble: 6 x 30
##   location_id StateFIPS StatePostal StateName
##         <dbl>     <dbl> <chr>       <chr>    
## 1         523         1 AL          Alabama  
## 2         523         1 AL          Alabama  
## 3         523         1 AL          Alabama  
## 4         523         1 AL          Alabama  
## 5         523         1 AL          Alabama  
## 6         523         1 AL          Alabama  
## # … with 26 more variables: StatePolicy <chr>,
## #   Mandate <dbl>, StateWide <dbl>, Curfew <dbl>,
## #   CurfewStart <chr>, CurfewEnd <chr>,
## #   VaccineExempt <dbl>, InGathLim <dbl>,
## #   OutGathLim <dbl>, InGathLimReligious <dbl>,
## #   OutGathLimReligious <dbl>,
## #   BusinessRestrictLevel <chr>,
## #   SchoolRestrictLevel <chr>, PublicMaskLevel <chr>,
## #   DateIssued <dbl>, DateEnacted <dbl>,
## #   DateExpiry <dbl>, DateEased <dbl>,
## #   DateEnded <dbl>, DateReexpanded1 <dbl>,
## #   DateReeased1 <dbl>, PolicyCodingNotes <chr>,
## #   PolicySource <chr>, LastUpdated <dbl>,
## #   LastUpdatedNotes <chr>, ReReviewed <dbl>

Here is a Michigan focus to illustrate the information the file contains.

## # A tibble: 10 x 2
##    DateIssued PolicyCodingNotes                        
##         <dbl> <chr>                                    
##  1   20200310 Expanded to state of emergency and state…
##  2   20200312 Gatherings of more than 250 people are p…
##  3   20200312 Schools are to remain closed through 202…
##  4   20200316 Limited operations required (take-out, d…
##  5   20200316 Closure required: theaters, cinemas, and…
##  6   20200316 Limited operations required (take-out, d…
##  7   20200316 Expands gathering restriction to a limit…
##  8   20200321 Expands required business closures to in…
##  9   20200323 (quote) This order must be construed bro…
## 10   20200323 (quote) Subject to the same exceptions, …

3.5.3 State-level poverty percentage

The Department of Agriculture lists percentage of state residents living in poverty. I downloaded the data and saved to a local spreadsheet. It lists both percentage of all residents and percent of children in poverty.

## # A tibble: 6 x 7
##   state     percent lower.perc upper.perc children.perf
##   <chr>       <dbl>      <dbl>      <dbl>         <dbl>
## 1 Alabama      16.8       16.5       17.1          23.9
## 2 Alaska       11.1       10.5       11.7          14.5
## 3 Arizona      14.1       13.8       14.4          20.4
## 4 Arkansas     16.8       16.3       17.3          23.8
## 5 Californ…    12.8       12.7       12.9          17.4
## 6 Colorado      9.7        9.5        9.9          12.1
## # … with 2 more variables: lower.perc.child <dbl>,
## #   upper.perc.child <dbl>

3.5.4 ADL

University of Wisconsin’s Neighborhood Atlas area deprivation index. This index is developed at the census block level and assesses income, education, employment, and housing quality. I manually downloaded these data and stored them locally so use the usual R command read.csv() to read this file. I aggregated over block by computing medians for each county in order to connect with other data I’m using here at the state and county levels. The ADI website suggests that the block-level “is considered the closest approximation to a `neighborhood’”.

## # A tibble: 6 x 12
## # Groups:   state.county [6]
##   gisjoin   fips  adi_natrank adi_staternk state county
##   <chr>     <chr>       <dbl>        <dbl> <chr> <chr> 
## 1 G0100010… 0100…          35            2 01    001   
## 2 G0100030… 0100…          96           10 01    003   
## 3 G0100050… 0100…         100           10 01    005   
## 4 G0100070… 0100…          61            4 01    007   
## 5 G0100090… 0100…          82            7 01    009   
## 6 G0100110… 0101…          94            9 01    011   
## # … with 6 more variables: state.county <chr>,
## #   adi_nat_median <dbl>, adi_st_median <dbl>,
## #   state.y <chr>, state_name <chr>, county.y <chr>

3.6 Other data sources

These are other sources that came on-line after I was well into building these notes.

  1. covdata: A git repository that includes some of the data sources I’ve already covered plus some others, such as the CDC Surveillance Network Data, which includes additional information in various tables such as total deaths (i.e., even the non-covid-19 deaths), number of pneumonia deaths, number of influenza deaths, etc., as well as data by age category. The dataset also includes mobility data collected from Apple and Google.

  2. New York City data A git repository with data tables by age, sex, zip code and other useful information.

  3. data.world A data set that can be queried for by state, by gender and by age category; includes number of total deaths, number of covd-19 deaths, number of pneumonia deaths and number of influenza deaths. The site does not appear to be updated frequently and requires login.

3.7 Summary

Just because data files can be created does not mean those data are useful or meaningful. There are differences across these data sources and some data are difficult to compare across states or over time (e.g., reporting of testing differs across states, different types of tests are used across states, the type of testing may have changed during the time interval). The myth is that if you have a large enough data set these kinds of issues will wash out in the aggregate. Maybe random error will wash out, but large data sets do not necessarily eliminate, by virtue of their size alone, systematic differences or biases. The size of the data set can impress and ease one into passive acceptance of a particular data pattern and interpretation. We must exercise the same skepticism we privilege small data sets, and be aware that large data sets may have additional issues not present in well-controlled small-scale studies.

A lot of code was presented in this chapter, but the code is in service of downloading the relevant up-to-date data. I wanted to automate the steps and write code for all my data manipulations so that you could see what I did and to make the output completely reproducible. One benefit of this approach is that I can run this code every day to get the current information so that all my analyses and plots in this book automatically update. A workflow process that requires I manually download a spreadsheet and edit that file would not be easily reproducible and would create extra work for me every day when I update this website.

R Notes

I usually save my workspace at this stage when all the data have been read in and processed, so that subsequent files can just load the workspace and pick up from here. This saves time because the same operations don’t have to be redone. But because these covid-19 data downloads happen daily it is better to recreate the workspace to avoid issues with newer data being available. The R functions save.image() and load() usually work well when data remain static, such as after data collection is completed and no further data cleaning is needed.