Chapter 2 Preliminaries

I want these notes to serve as an example of how to keep your analyses organized and reproducible. Analysis projects have many common high-level elements, which include gathering data, cleaning and organizing data, preparing descriptive summaries, testing hypotheses, writing reports, and dissemination. The end product should be a complete pipeline that takes you from soup to nuts with clear documentation. These notes provide one example of how to prepare a workflow. Such workflows typically make use of many software tools, and I would like to highlight some of the tools I used. All the tools used to produce this book were open-source.

These notes are written completely in Rmarkdown, a scripting language that allows one to weave text, R statistical commands, and R output to create a document. They are organized across several Rmd files (like this one) that are executed in a particular order. For example, the file that reads the data and cleans the data is sourced before the file that produces the descriptive summaries of the data. See Appendix D to learn some basic details about R and find links to tutorials. The formatting of the files into both html and pdf was enabled by the bookdown R package, which makes use of html constructs as well as LaTeX document formatting. See Appendix E for some basic information about the bookdown package. I also made use of the RStudio integrated editor that allowed me to type all these words and commands in a way that enabled debugging of code and natural weaving in of statistical output and figures.

I can reproduce all my analyses and so can you. I put the files for this project (except for the data downloads) in a git repository that you can clone and follow along. I tried to be diligent in using git commits and writing clear comments so you can see how these notes developed; edits I made to the text; code I edited, de-bugged and re-edited; turns I took in my analyses that were dead ends so I didn’t pursue them; etc. This is all part of the regular scientific process. In the “old days” scientists maintained lab notebooks where they recorded everything they did. A git repository serves the analogous role of an electronic library when conducting analyses. It keeps track of everything I did for this project. You can read the final product (the notes you are currently reading) but you are free to see my thought process, see how I edited these documents, and see the order in which I actually wrote things rather than the order they are presented. All you have to do is visit the github site for this project. To learn about git see Appendix A.

2.1 Libraries and Setup

I need to set up some housekeeping commands. This includes formatting the R code so that it doesn’t spill off the margin of the page and caching some time consuming computations. There are some analyses in this book that can take a few hours to run. Rather than running them every time I recompile these notes, I cache those results in a folder so the results are re-used and only re-run those analyses periodically.

Rmarkdown automatically detects when data have changed and those time consuming analyses need to be rerun. I prefer to exert manual control on the caching process and explicitly say cache=TRUE or cache=FALSE.

A variable that I set up in the index.Rmd file is longform. This is set either to TRUE or FALSE depending whether I want the “long version” of the book with all the R code and some additional explanations tied to R or the “short version” that omits the printing of the R code and some text specifically related to R. For example, this paragraph is part of an Rmarkdown chunk that sets the echo argument to longform. If longform is TRUE this paragraph is printed, whereas if longform is FALSE the paragraph is omited. This allows me to embed conditional statements and easily print two versions of the same document (i.e., I don’t need to include additional R code explanations in the version that does not include the R code). I use this feature in my own data analysis depending on whether I want to discuss all the analytic details with my colleagues or focus on the end product of the analyses.

# assumes knitr is installed in R
library(knitr)

htmllinewidth <- 55
pdflinewidth <- 55

opts_chunk$set(tidy.opts=list(width.cutoff=pdflinewidth),
  tidy=TRUE, echo=longform, warning=FALSE, message=FALSE)

options(htmltools.dir.version = FALSE, formatR.indent = 2, 
        width = htmllinewidth, digits = 3)


# coupled with linewidth=xx argument to chunk will make R code fit 
# within the width of the page
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
  # this hook is used only when the linewidth option is not NULL
  if (!is.null(n <- options$linewidth)) {
    x = knitr:::split_lines(x)
    # any lines wider than n should be wrapped
    if (any(nchar(x) > n)) x = strwrap(x, width = n)
    x = paste(x, collapse = '\n')
  }
  hook_output(x, options)
})

# cache=F to redo the lengthy analyses for new data, 
# otherwise set to T for daily runs
cache <- T

I prefer to organize in one place all the R packages that are used in these notes so they can be easily installed. Typically, R packages are loaded with the library() command, which I’ll use here. The library() command assumes the R package is already installed.

There can be issues that arise when loading multiple packages, as I do in this project, such as conflicts across packages that may use the same name for different functions. For example, the package dplyr contains a function called select(), but there is also a select() function in another package I am using. I need to be careful I’m using the correct select() command and one why to accomplish that is by using the double colon construction. The command dplyr::select() instructs R to use the select() command that is contained in the package dplyr.

library(bookdown)
library(readr)
library(tidyverse)
library(rvest)
library(magrittr)
library(stringr)
library(tidycensus)
library(ggplot2)
# gganimate needs gifski to render animations;
# dependency needs to be installed manually
library(gganimate)
library(ggeffects)
library(scales)
library(maps)
library(ggthemes)
library(viridis)
library(ggrepel)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(lme4)
library(nlme)
library(merTools)
library(Hmisc)
library(readxl)
library(rnn)
library(stargazer)
library(texreg)
library(gamm4)
library(splines)
library(mgcv)
library(nlrx)
library(pander)
# use this package just for sech
library(pracma)
library(ggforce)
library(RColorBrewer)
library(investr)
library(forecast)
library(hts)
library(ggfortify)
library(brms)
library(bayesplot)
library(tidybayes)
library(tibbletime)
library(NeuralNetTools)
library(nnet)
if (!("devtools" %in% installed.packages()[, "Package"])) {
  install.packages("devtools")
}

install.github.libs <- F
if (install.github.libs) devtools::install_github("wmurphyrd/fiftystater")
library(fiftystater)

if (install.github.libs) devtools::install_github("UrbanInstitute/urbnmapr")
library(urbnmapr)

if (install.github.libs) devtools::install_github("davidgohel/ggiraph")
library(ggiraph)

if (install.github.libs) remotes::install_github("CollinErickson/GauPro")
library(GauPro)

if (install.github.libs) devtools::install_github("nelsonroque/tsfeaturex")
library(tsfeaturex)

For ease of use it would be better to use the pacman package because it also checks if the libraries are installed on your computer (if not, the package automatically downloads the missing libraries along with all their dependencies). Finally, pacman initiates the library() call for each library in the list leaving you ready to proceed with the analyses. For illustration, here is the pacman code you would use. This facilitates reproducibility because it is easier to recreate the working environment with all the required packages.

# if pacman isn't installed, then install it and issue
# library(pacman)
if (!("pacman" %in% installed.packages()[, "Package"])) {
  install.packages("pacman")
}
library(pacman)
# double check all the libraries listed above are listed
# here too
p_load("bookdown", "readr", "tidyverse", "rvest", "magrittr", 
  "stringr", "tidycensus", "ggplot2", "gganimate", "ggeffects", 
  "scales", "maps", "ggthemes", "viridis", "ggrepel", 
  "sjPlot", "sjmisc", "sjlabelled", "lme4", "nlme", "merTools", 
  "Hmisc", "readxl", "rnn", "stargazer", "texreg", "gamm4", 
  "splines", "mgcv", "nlrx", "pander", "pracma", "ggforce", 
  "RColorBrewer", "investr", "forecast", "hts", "ggfortify", 
  "brms", "bayesplot", "tidybayes", "tibbletime", "NeuralNetTools", 
  "nnet")

2.2 Utility functions

Here are some functions I wrote that I will use later. I put them here to improve flow and readability of the key content.

2.2.1 Multiple Plots

This function puts multiple ggplots on the same page. There are now packages that can do this as well, like patchwork, gridExtra and gtable, but I’ve been using this simple function for years even before such packages existed. I offer this function, which I got off someone’s website years ago, as an example of how relatively simple some R functions can be, even when performing complex tasks like arranging multiple plots on a page. I believe this multiplot() function is now embedded in a larger package.

# Multiple plot function ggplot objects can be passed in
# ..., or to plotlist (as a list of ggplot objects) -
# cols: Number of columns in layout - layout: A matrix
# specifying the layout. If present, 'cols' is ignored.
# If the layout is something like matrix(c(1,2,3,3),
# nrow=2, byrow=TRUE), then plot 1 will go in the upper
# left, 2 will go in the upper right, and 3 will go all
# the way across the bottom.
multiplot <- function(..., plotlist = NULL, file, cols = 1, 
  layout = NULL) {
  library(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel ncol: Number of columns of plots nrow:
    # Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), 
      ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots == 1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), 
      ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j positions of the regions that contain this
      # subplot
      matchidx <- as.data.frame(which(layout == i, 
        arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, 
        layout.pos.col = matchidx$col))
    }
  }
}

2.2.2 Caterpillar Plot

Useful plot function from this link that I’ll use to plot random effects. I won’t print it here but if you are interested you can refer to the github page for this book.

R Notes

For better reproducibility it is good to use a package like renv that saves the current versions of all your packages used in your pipeline. This way if a package is updated tomorrow and the new version breaks your code, you at least have the earlier version of the package available locally on your drive to continue running the same code. Another approach would be to put the entire project in a Docker container, which is a complete self-contained running environment include operating system and all relevant executables.