# Chapter 7 Deeper Understanding

In this chapter I consider some hypotheses that can be tested with relatively simple methods.

## 7.1 Regression Discontinuity

This subsection will show a fun test that we can examine in real time. Some governors of US states have decided to reopen businesses much sooner than recommended by public health officials. For example, Georgia has decided to reopen gyms, massage therapy, tatoo shops, barbers and hair stylists on April 24th with restaurants, theaters and private social clubs opening up on the 27th. It will be difficult to practice social distancing while getting a tatoo or a body piercing.

Pending: I’ll plot the data for Georgia by county 3 weeks before the 24th and 3 weeks after. It will take a few days, possibly up to 14 days, to see the effects on the positive testing rate due to a higher rate of social contact, and maybe up to two weeks to see any effects on the death rate. I’ll examine a similar plot for an adjacent state that has not relaxed their social distancing mandate for comparison.

This example will illustrate the use of the regression discontinuity (RD) design, though before the causality police get on my case for not clarifying my assumptions I’ll point out that I won’t be using this to claim I have estimated the causal effect of social distancing. I’ll illustrate the idea of this design, its characteristic plot and point to papers where the reader can learn about the various approaches to extracting additional information from this type of research design.

The city of Atlanta is in Fulton county, which is the top curve

For comparison, I’ll pick an adjacent state that has not relaxed their restrictions. Alabama seems a safe comparison as both South Carolina and Florida have begin partial openings (e.g., Florida opened beaches prior to April 24). One could also examine Tennessee and North Carolina, which also border with Georgia.

While I have a preference for examining incidence plots, the county-level data are too noisy to make incidence plots useful. But, at least, they do highlight the noise unlike the cumulative count data that acts like a visual smoother to the data patterns in terms of the visual effect on the curves. Here is the incidence plot for Fulton county in Georgia. The red vertical line marks April 24, 2020, when many restrictions in Georgia were lifted. For comparison, I also show Montgomery county in Alabama. We see that the trend remains the same for Fulton county George post restrictions but the comparison county in Alabama exhibits a higher rate of Covid-19 cases. This is opposite to hypothesis.

The good news is that the study found that patient-reported physical activity and quality of life status is associated with fewer re-admissions. So if we can get these covid-19-related ARDS survivors walking and exercising we may have some hope. Unfortunately, many ARDS survivors have trouble merely standing up from a chair let alone going for a walk around the block wearing a mask.

More on this to come. Here are some initial plots but the data are suspect with various missing days and implausible values such as New York reporting 9 consecutive days of reporting 5016, which suggests lack of daily update.

## 7.3 Deaths

The SIR model can be explored in more detail and its implications can be examined. Here I’ll focus on the simple SIR model without any of the bells-and-whistles I discussed in Chapter 6. The three primary equations of the SIR model are not easy to work with directly and they require numerical methods to approximate.

I’ll pursue one approach that avoids numerical methods by making some assumptions that will simplifying these expressions. As you will see, this simplification provides a reasonable fit to the data but their primary importance is in providing intuition into how to interpret these equations.

Following Keeling and Rohani (2008), if we assume that the $$R_0$$ is relatively small and that people interact randomly, then the incidence curve from the SIR model can be approximated with this form

$a \;\;\mbox{sech}^2 (\kappa_0 + \kappa_1 t)$ where sech is the hyperbolic secant, its argument is a linear transformation of time t with a slope and intercept denoted by $$\kappa$$s, and parameter $$a$$ is a multiplier. These parameters are each functions of the parameters of the SIR model (the $$\beta_1$$ and $$\beta_2$$ from Chapter 6) as well as the starting count at time 0. The hyperbolic secant can be reexpresssed in terms of exponentials: $$\mbox{sech} (x) = \frac{2}{\exp^{x} + \exp^{-x}}$$. To arrive at this expression one could work with the SIR equations presented in Chapter 6 directly, but there is a simple way to get this form by assuming a Poisson distribution on the counts and showing that the probability that a randomly selected individual is not infected when the epidemic has an $$R_0$$ is $$e^{\frac{-R_0}{N}}$$ (i.e., a Poisson with k=0 and a rate of $$R_0/N$$); see Keeling and Rohani, 2008, Ch 2.

Let’s perform a direct fit to the daily death count, since death is a more appropriate outcome for the SIR model. For this section I’ll use the covid tracker data (see Chapter 3). We haven’t worked much with that data yet so I’ll also need to do some data cleaning and formatting to be consistent with the other analyses conducted in this book.

## [1] "File name associated with this run is v.95-111-gbfd5aae-dataworkspace.RData"

FIX TO PERCAPITA DEATH

We see that this function (red curve) does a reasonable job of fitting the daily death counts (black points) in New York. There is noise in these data that this form of the SIR model cannot easily pickup. I show an additional estimation using only the data up through day 42 (in green). This curve shows that the functional form fits reasonably well up through the peak but does not capture the asymmetry in terms of the slow decline in the frequency of new cases. This suggests there are additional processes that this simple implementation of the SIR model does not capture, and provides a clue for additional modeling changes one could implement.

## Nonlinear mixed-effects model fit by maximum likelihood
##   Model: deathIncrease ~ richfu.sir(day.numeric, a, b0, b1)
##  Data: covid.tracking.new
##     AIC   BIC logLik
##   50192 50237 -25089
##
## Random effects:
##  Formula: list(a ~ 1, b1 ~ 1)
##  Level: state
##  Structure: General positive-definite, Log-Cholesky parametrization
##          StdDev   Corr
## a        1.02e+02 a
## b1       5.96e-03 0.649
## Residual 3.72e+01
##
## Fixed effects: a + b0 + b1 ~ 1
##    Value Std.Error   DF t-value p-value
## a   46.1     14.39 4898     3.2  0.0014
## b0  -1.9      0.03 4898   -61.1  0.0000
## b1   0.0      0.00 4898    25.2  0.0000
##  Correlation:
##    a      b0
## b0 -0.022
## b1  0.446 -0.388
##
## Standardized Within-Group Residuals:
##      Min       Q1      Med       Q3      Max
## -8.13483 -0.07805 -0.00752  0.08741 50.22713
##
## Number of Observations: 4950
## Number of Groups: 50

The data for death is quite noisy, for example, here are plots for both New Jersey and Michigan.

This large daily variability is troubling. One possibility is the way some states report deaths. In Michigan, if a death is determined to be covid-related it will be added to the day’s count even if the actual death occurred days earlier. To illustrate, on Saturday May 9, 2020, Michigan reported 133 deaths but 67 of those “deaths occurred in recent days” and determined to be covid-related after the fact were added to the Saturday total see. New York City, on the other hand, counts covid deaths when a positive test result and “probable” if the death certificate lists covid-19 as the cause of death (even if not verified with a known positive lab test). You may also be wondering about the single outlier for New York on May 6, 2020. That was due to how the state reconciled deaths from nursing home (see). Determining the number of deaths is not an easy task as this New York Times article shows. This also points out how we must examine data very carefully; our models and inferences will suffer greatly if we are too careless.

To be fair, these three plots (New York, New Jersey and Michigan) have different scales for the Y-axes, which may exaggerate the variability. I’ll redo the three plots with the same scale on the Y axis to check that possibility.

But a more interesting result emerges by examining this approximate functional form. We can integrate this approximation over t to see what form emerges for the cumulative death count. In this context, integration is analogous to summing the daily counts to create a cumulative count but we do this symbolically on the function rather than the data in order to learn the form of the function on the cumulative count.

The general form is

$\bigg[\frac{a}{\kappa_1 (1+e^{2(\kappa_0 + \kappa_1 t)} )}\bigg] \;\;\; \bigg[(e^{2(\kappa_0 + \kappa_1 t)} - 1)\bigg] + C$ The left square bracket term has the form of the logistic growth function and the right square bracket term has the form of an exponential growth function. The term $$C$$ is the integration constant that we will set so that the cumulative sum starts at 0. Thus, this simplification of the SIR model yields a product of the two functions we explored in Chapter 5. It isn’t one or the other as we explored earlier, but this model suggests the curves approximately follow a combination of both forms. This is an example of the advantage of doing some mathematical modeling to be able to examine the implications of your model. In the earlier chapter we considered exponential and logistic growth models without careful attention to why we chose those functions. This type of mathematical modeling allows you to justify the functional forms you want to test based on implications from your hypotheses. Here I’m referring to the change equations of the SIR model as a set of hypotheses that govern how change occurs among the susceptibles, the infecteds and recovered. Those equations (with some additional assumptions and approximations) imply this product of both forms, which we can test and examine with data. Not counting the integration constant this approximation also has 3 parameters as we had in both the exponential form and the logistic growth form in Chapter 5.

The fit is ok given the simplifying assumptions that were made to derive this form, such as assuming that $$R_0$$ is small, independence in transmission and some of the approximations that were used to derive this form (see Keeling & Rohani, 2008). We can reject a Poisson outright because the $$R_0$$ value likely changed over time with different levels of social distancing.

It is helpful to estimate the logistic function directly on the cumulative death count and superimpose that curve (blue) on the same graph. We see that the pure logistic provides a better fit to these data. This shouldn’t be too surprising since I fit the logistic form directly to the data, whereas the red curve corresponding to the approximate SIR model takes into account other processes, has additional simplifying assumptions and approximations. Nonetheless, we can use the data and the fit to the models, to decide if we want to reject the SIR model because that model implies a functional form that fits worse than a simpler logistic growth model, use this finding to decide whether we should relax those simplifying assumptions and derive a new form, or use this finding to justify using the SIR model because the fit to the data is “close enough.” Of course, close enough depends on different applications. If one wants to describe a bunch of points with a simple curve or one wants to use the model to make forecasts. In the case of forecasting to inform public policy, then “close enough” better be pretty darn close because many people’s lives will be affected. Often, the disagreements between modelers about their predictions boils down to details of the assumptions each modeler makes and the kinds of simplifications they use.

## NULL

Compare AIC

##                    df   AIC
## out.logistic.death 10 63082
## out.nlme.deaths     7 50192

Add phase plot of this model

## 7.4 Functional Equations

Need to develop this section. Currently just a few plots.