Chapter 2 Preliminaries

I want these notes to serve as an example of how to keep your analyses organized and reproducible. Analysis projects have many common high-level elements, which include gathering data, cleaning and organizing data, preparing descriptive summaries, testing hypotheses, writing reports, and dissemination. The end product should be a complete pipeline that takes you from soup to nuts with clear documentation. These notes provide one example of how to prepare a workflow. Such workflows typically make use of many software tools, and I would like to highlight some of the tools I used. All the tools used to produce this book were open-source.

These notes are written completely in Rmarkdown, a scripting language that allows one to weave text, R statistical commands, and R output to create a document. They are organized across several Rmd files (like this one) that are executed in a particular order. For example, the file that reads the data and cleans the data is sourced before the file that produces the descriptive summaries of the data. See Appendix D to learn some basic details about R and find links to tutorials. The formatting of the files into both html and pdf was enabled by the bookdown R package, which makes use of html constructs as well as LaTeX document formatting. See Appendix E for some basic information about the bookdown package. I also made use of the RStudio integrated editor that allowed me to type all these words and commands in a way that enabled debugging of code and natural weaving in of statistical output and figures.

I can reproduce all my analyses and so can you. I put the files for this project (except for the data downloads) in a git repository that you can clone and follow along. I tried to be diligent in using git commits and writing clear comments so you can see how these notes developed; edits I made to the text; code I edited, de-bugged and re-edited; turns I took in my analyses that were dead ends so I didn’t pursue them; etc. This is all part of the regular scientific process. In the “old days” scientists maintained lab notebooks where they recorded everything they did. A git repository serves the analogous role of an electronic library when conducting analyses. It keeps track of everything I did for this project. You can read the final product (the notes you are currently reading) but you are free to see my thought process, see how I edited these documents, and see the order in which I actually wrote things rather than the order they are presented. All you have to do is visit the github site for this project. To learn about git see Appendix A.

2.1 Libraries and Setup

I need to set up some housekeeping commands. This includes formatting the R code so that it doesn’t spill off the margin of the page and caching some time consuming computations. There are some analyses in this book that can take a few hours to run. Rather than running them every time I recompile these notes, I cache those results in a folder so the results are re-used and only re-run those analyses periodically.

I prefer to organize in one place all the R packages that are used in these notes so they can be easily installed. Typically, R packages are loaded with the library() command, which I’ll use here. The library() command assumes the R package is already installed.

R Notes

For better reproducibility it is good to use a package like renv that saves the current versions of all your packages used in your pipeline. This way if a package is updated tomorrow and the new version breaks your code, you at least have the earlier version of the package available locally on your drive to continue running the same code. Another approach would be to put the entire project in a Docker container, which is a complete self-contained running environment include operating system and all relevant executables.