Chapter 4 Descriptive Plots
This chapter presents several descriptive plots of the raw data. No statistical analyses, no standard errors, no parameters to estimate, no hypotheses to test. Just raw data presented in different forms to gain insight into the patterns that are in the data. I’ll create a few plots at the world level, and then switch to the state and county levels for US data. You are free to take my code and edit it to produce additional plots for other countries and their states or provinces. Throughout these notes I’ll focus on the counts and rates of positive tests. I’ll defer death counts until the chapter on process models.
As stated in a few places in these notes, we need to be careful in how we interpret the counts of people who test positive. There are many issues such as countries or states that test more frequently may see higher counts because they do more testing, counts can vary across countries or states because decisions about who is tested may vary, the quality of the tests used may vary across units, and countless other differences that make it extremely difficult to generate clear explanations for patterns we may see. Moving to other variables such as the number of deaths doesn’t completely solve the problem either because, among other things, countries and states may vary in how they count deaths (e.g., death in a hospital, death among people who tested positive vs. deaths that are “presumed to be covid-19 related”).
4.1 World Map
Some code I copied from here.
I make use of the datacov.World file I read in from the Johns Hopkins git repository.
You’ll notice that some countries like Canada, UK, Australia have multiple points but most countries are represented by only 1 point. This is because the data set we are using has some inconsistencies in reporting either country totals or state/province totals. This would require further cleaning (e.g., computing country totals for those countries) but I left this in here as a teaching moment of how one needs to be careful working with data sets and consistently double check your assumptions about the structure of the data set you are using. The structure may also change over time so one needs to monitor carefully.
Even more sophisticated is the interactive map developed at Johns Hopkins (same group that provides the key data we downloaded in Chapter 3), where you can zoom the map in and out, click on different countries (left panel) and see data for that country on the right panel. A small interactive display of that website is displayed below.
4.1.1 World Map with Time Animation
I’ll expand this example with some of my code to create an animation so we can see the total cases in the world map change by day.
While counts are useful (see the Introduction for a discussion of London’s cholera epidemic) they have their limitations. The help address issues with different countries having different populations, it is common to normalize counts by the population and report numbers per capita (e.g., 62 cases out of 100,000). One would have to track down the populations of each country, download the relevant population data using methods described in Chapter 3 and merge that information into this animation. Later in this chapter, I develop the analogous animation for the US and I already downloaded the state populations so will illustrate the difference in this animation in counts versus per capita However, raw counts still have a role in helping to inform the impact of covid-19. If a city has 1500 available hospital beds (and presumably the staff and supplies to provide care for those beds), but there are 2000 people in need of hospitalization, then there is a public health issue and the ’per capita" concern becomes moot.
For more information on working with maps see Gimond.
4.2 Compare World Counts to US
It would be helpful to see the pattern over time of the world counts of confirmed covid-19 cases relative to the US counts. Here I create a subset of the datacov data.frame that includes just the 50 states (I dropped DC and the US territories for this analysis).
It is helpful to place a horizontal line in the cumulative US plot that corresponds to the US fraction of the world population. If covid-19 cases were randomly distributed throughout the world, we would expect a country to have a fraction of cases consistent with the fraction of that country’s population relative to the world population. The US is not doing well as our total number of cases are far above the red line that corresponds to the proportion of cases we would expect given our fraction of the global population. Of course, we don’t know how well other countries are reporting their positive cases, how many tests they have administered, and other relevant information, but this descriptive plot suggests the rate of positive cases may be greater than expected so more investigation is needed.
4.3 US State-Level Plots
I decided to plot the percentage relative to population (i.e., counts/population * 100). The numbers are small. I’ve seen people report this as cases per 100,000, but I decided to stick with cases per 100 to maintain the percentage interpretation. This is just a scale issue and doesn’t affect the plots or the analyses. The labels in the plots, like “7.5 WA,” mean Washington state with a population of 7.5 million.
Let’s take the same plot but rescale the vertical axis on the log (base 10) scale. An exponential process becomes linear when taking logs so we would expect to see a pattern that closely resembles straight lines if the count of confirmed cases follows an exponential form. Each state can have a different growth rate, which will show up as different slopes, and different starting points, which will show up as different intercepts.
A straight line seems to approximate the pattern but there is a suggestion that some curvature remains, which implies these may not follow an exponential pattern. West Virginia (WV) didn’t show its first case until 3/17/20 so that is why the green curve (lowest) looks different than the rest.
To help see the state by state structure in these curves (e.g., whether or not they are linear in the log scale), I partitioned the 50 states by population (sample sizes in each panel, yielding 12 or 13 states per panel). The states with larger populations (two lower panels) show relatively linear patterns. The states with smaller population sizes (the two upper panels) show patterns suggesting the per capita rate may still be increasing, whereas states with larger populations may be leveling off. We would need statistical analyses to verify the observations in these descriptive plots.
I want to make a further improvement in this plot where I set 0 to missing value (NA) so that the curves begin where there are nonzero points. This modification to the plot makes it easier to detect linearity as we don’t have the artificial jump in the curve from 0. But, while it makes the plots a little easier to understand, it drops the important information about liftoff, the point at which the count moves from zero to nonzero. This is something that could be modeled with a parameter in the structural model that I will develop in later chapters and so, in principle, one could examine which factors affect the liftoff. Another issue is that log(0) is undefined so to avoid 0s you’ll see some websites and research papers showing such graphs requiring a minimum number of cases (e.g., the day at which the state reached 10 cases) before they start plotting the curve for that state. In the spirit of transparency one should report all these graphing decisions and ideally provide the code that produced the figures provided in a paper. Actually, complete reporting would include not only the code that produced the figure but also all code from the code that downloaded the data to the code that put the data in the format needed to produce the figure or table.
This pattern is remarkable. The states appear to have similar slopes on this log plot. They vary in intercept, but that reflects when the state started reporting positive cases. It seems states are on a similar growth trajectory, but some states are further along (higher intercepts) than other states. That the log transformation of the Y axis converted these curves into approximate lines is consistent with the underlying pattern of these data following an exponential model. But we could argue that the slight curvature away from a straight line goes against an exponential growth process. I’ll say more about this in the next chapter where I’ll cover the exponential more directly, both in its log linear form as in these graphs as well as through nonlinear regression models.
4.3.1 Animated Map of US
Here is the animation of the US. This map needs work as Alaska and Hawaii outlines are not printed (their data points appear on the left side roughly where the states would be on the map). I’ll need to do some coding to create an inset for Alaska and Hawaii.
Because I have the state-level population sizes I can redo the animation using a per capita normalization out of 100. I modified the breaks and labels manually but there may be a way to do that automatically with ggplot2 features.
I don’t like these plots because they use latitude and longitude to place a point on the graph to represent the entire state. It may be better to do this plot so that the entire area of the state is colored in according to the positive count rather than representing count as a single point. Here is one example using the count of positive tests per state.
4.3.2 US county-level plots
We can move to a different level of resolution with county-level data. I used the log10 scale on the number of positive tests for each US county to achieve a smoother transition across the color range. White represents both counts of 0 and counts of 1. This plot uses count and not the per capita count.
The county-level plot suggests some hypotheses to test. For example, in May there has been an increase in the number of protests over state social distancing restrictions and business closures. One could examine associations between the number of protests and properties of the states such as the political party affiliation of the governor. But with respect to county-level data, it appears in skimming the map of the US that some states have quite a bit of variability across counties in the positive test counts, whereas other states have relatively little variability with most counties having relatively high or relatively low positive test counts. One hypothesis to check is whether protests are more common in states with greater variability across counties (that is, some counties not experiencing as many cases as other counties in the state).
4.4 Incidence Plots
With time series data it is common to work with what are called first order differences. Rather than examine cumulative counts you look at day to day differences in the cumulative counts, or equivalently, the number of new cases each day. This is what give rise to the “curves” when people talk about “flattening the curve”. Think of this as daily counts and we merely compute histograms. This section was adapted from Tim Chruches Blog.
Here is the daily incidence rate using the same US data starting on 3/10/20. This plot does not include DC and the territories. I also include a 7 day moving average (blue curve). This is the curve traced by computing the mean of the 7 previous days and using that value for the day’s count, then tracing those points with a curve. In the month of May many news outlets began including such moving average summaries rather than the raw data. They make the data look less variable than they actually are, which is not a good thing when modeling data. We want our models to take into account the variability in the data rather than mask the variability.
We can add interactive capabilities to the ggplot by using the ggigraph package.
The html version of this book allows popups when the mouse hovers over the bars. You will notice that the number of positive deaths has a weekly pattern emerging on Monday April 13, 2020. There is a trough that occurs weekly on Mondays through at least May 11. I will monitor on subsequent weeks. This is an important part of the variability these data exhibit, which is lost if we “smooth” the data by computing moving averages as shown in the blue curve of the previous graph.
The weekly pattern is interesting and could be explained, for example, by fewer tests conducted on weekends with a delay in the reporting of tests results. We can check this by examining another data set that reports the total number of tests conducted on a day, which is the sum of the number of positive tests, the number of negative tests and the number of pending tests. Of course, the number of positive tests will be correlated with this total measure, and this definition of total introduces some additional daily correlations due to some double counting (i.e., a pending today may show up as a negative tomorrow once the results are known). Further, this highlights the difficult of interpreting such data, for many reasons, including the issue that tests vary in the length of time needed to achieve a result, so if states vary in which tests they use or there are changes over time in which tests are used, we would see such patterns reflected in these counts. Such changes may say more about the decisions made on which tests to use than on the changing properties of the distribution of positive and negative test results.
The bar plot of the total test results is presented in the next figure. While the number of tests administered is increasing over the weeks, there is a slight pattern that fewer tests are reported on Mondays relative to surrounding days (e.g., April 27, May 4, May 10). The dip at the end of May can be attributable to the Memorial Day Weekend and delay in reporting.
Overall, we see that the number of daily tests has been increasing over recent weeks. This adds an important caveat moving forward as we interpret positive tests results in the remainder of this book. We may see an increase in positive tests not because virus contamination is increasing but because there are more tests being conducted. Of course, if the number of positive cases decreases in spite of the number of tests increasing, this suggests the virus rate is decreasing. Such a conclusion assumes the characteristics of the tests remain the same. If over time, the quality of the tests changes in terms of false positives and false negatives, that would add an additional layer of complexity in interpreting such testing data.
4.5 Using PCA to check for outliers
Principal components analysis (PCA) has many uses. In interesting use of PCA in time series data is that it can be used to detect outliers in a complicated data set. Let’s take the 50 states and the time series from 3/09/20 to present, to create a 50 by day data matrix of positive counts. Then compute a PCA of the correlation matrix between states. This correlation matrix is 50 x 50 and represents how similar each state’s cumulative trajectory is to another state’s cumulative trajectory for all possible pairs of states. Plot the factor scores of the 50 states on the first two PCs. The majority of the states will cluster together. The outlier states, those with very different trajectories, appear far away from the primary cluster suggesting they have a very different trajectory. Four candidate outliers are the 4 states with early covid-19 cases: NY, WA, NJ and CA. Recall that the numbers in front of the state abbreviation correspond to the state population in millions. This type of PCA is commonly done in biological modeling to check for outliers, sometimes not on the raw data like I did here, but on the set of parameters that are estimated for each unit. For example, if you run regressions for each state, gather the betas from those regressions as data (such as 8 betas per state if you had 8 state-level predictors), compute a correlation matrix across states, then run a PCA on that correlation matrix to detect states with an outlier pattern across their betas relative to other states.
In this chapter I focused on country, state and county-level analyses. Of course, finer resolution could be at by zip code, or any other sensible way of partitioning the map. A different type of partition could be in how hospitals are structured in an organized system across the US. There are about 340 “hospital referral regions.” These are regions where there is a primary hospital that can handle specialized cardiovascular procedures and several other hospitals, perhaps part of different systems, that refer patients needing these specialized services to the primary hospital. Or even more fine-grained, there are “hospital service areas”, which make up zip codes that are serviced by a particular hospital. Here are such maps for covid. When one starts adding predictors or outcomes of these trajectories, then the level of resolution becomes critical. If one wanted to examine the contributors to health disparities around covid-19 and how they impact the properties of the trajectories, then one should consider matching the level of analysis of the trajectory (state, county, zip code, hospital service area, population density, etc.) with corresponding predictors at similar levels such as predictors relevant to government expenditures, measures of income inequality, indices of urbanization, unemployment rates, etc. (that is, each of those predictors would be evaluated at the similar level as the trajectory data).