Kurt Schulzke, JD, CPA, CFE

I. Introduction

What would the Securities & Exchange Commission do if Home Depot reported as its 2020 “net income” only its revenues (not expenses), from only its best eight months, in only its best 18 states? What if we replace “Home Depot” with “CDC”? Which organization should be held to higher accounting and disclosure standards? Which has more power to damage the United States with misleading information?

In 2020, the CDC’s budget ballooned from $8 billion to roughly $46 billion, all in response to perceived Covid-19 mortality. How many deaths has Covid-19 caused? Causal analysis is always a tricky business,1 but pinpointing cause of death is notoriously error-prone,2 more so when a reported cause is used to solicit billions in funding. Indeed, the CDC emphasizes the funding connection twice on the first page of its 65-page death certificate handbook.3 In support of its remarkable budget increase, and claiming concern that death-certificate-based counts were under-counting Covid-19 deaths, in fall 2020, the CDC began estimating “excess” deaths as a proxy for Covid-19 deaths.

To be clear, the CDC’s estimate is not data. It is an estimate, which means it is one judgmental interpretation of data among multiple possible interpretations. Is the CDC’s estimate realistic? To find out, we will explore the CDC’s data and evaluate the model they used to make their estimate.

The key high-level questions are these: What is the plausible range of “excess” deaths in a relevant place over a relevant time period? What share of excess deaths might be fairly attributed to Covid-19? And how transparently does the CDC present the range and share to readers? We offer some answers to these questions in Part V. We’ll keep things simple and (mostly) visual, avoiding complex statistical tools.

At the outset, we note that the SEC’s fundamental “anti-fraud” standard, Rule 10b-5, declares it illegal, in buying or selling stocks, to (a) employ any device, scheme, or artifice to defraud, (b) mislead readers by statements or omissions, or (c) engage in any act, practice, or course of business which operates as a fraud or deceit upon anyone.

Let’s see how the CDC measures up to stock market disclosure requirements. In doing so, we’ll generally follow the AICPA standard for auditing management estimates, SAS 143, which admonishes that the importance of professional skepticism:

…increases when accounting estimates are subject to a greater degree of estimation uncertainty or are affected to a greater degree by complexity, subjectivity, or other inherent risk factors. Similarly, the exercise of professional skepticism is important when there is greater susceptibility to misstatement due to management bias or fraud.4

In this connection, we should also consider the “fraud triangle,” which holds that opportunities, incentives, and easily accessible rationalizations heighten fraud risk. All three legs of the triangle are abundantly evident here.

Before we engage the CDC’s model, we’ll peek at the “raw” data (scaled to per thousand population) while keeping in mind that it is curated by the NCHS, a division of the CDC. We’ll assume that the data are what the CDC says they are: unvarnished, reported deaths data. An audit would be required to verify the validity of this assumption. For context, the U.S. crude annual mortality rate has recently averaged 8-9 nationally, ranging between 12.5 (WV) and 9 (Alaska). And, according to the CDC, the 1918 flu killed 675,000 out of 103 million Americans (6.6 DPT).

Keep in mind that we are looking at total deaths data, not Covid-19 deaths data. While the popular narrative makes it difficult to remember this, even the CDC admits in their Technical Notes that they do not know what share of “excess” deaths can be blamed on Covid-19. This means that they cannot say how many people have truly died from Covid-19. No one can. Not the WHO, not Johns Hopkins U., not Emory U., not Georgia Tech. No one knows. For all we know, the Covid-19 narrative may be masking other, more deadly causes.

Figure 1 gets us started. It visualizes weekly total U.S. DPT from all causes and all jurisdictions since 2014. The CDC says that the data for pre-2020 years is “final” and that 2020-21 data are “provisional,” though the “all cause” deaths figures (from most states) appear pretty stable after about four weeks.

ByStates1420(c("United States"), c("Fig. 1.1"), c("2014-01-04"), c("2020-12-20")) +
geom_hline(yintercept = 0.2071215, size = .6, color = "green4", alpha = .5) +
  labs(title = "Weekly All-Cause Deaths per Thousand", subtitle = "Jan 2014 - Dec 19, 2020", y = "Deaths per Thousand Pop", x = "Month & Year", caption = "Source: CDC/NCHS; U.S. Census. \n 2020 population is assumed equal to 2019 U.S. census estimate.", tag = c("Fig. 1"))

Even at this altitude, the 2020 data paint a strange picture compared to prior years. The horizontal green line marks the high-point of the 2018 flu season. The y-axis scale runs from 0.10 to 0.95 deaths per thousand (DPT) to accommodate spring 2020 spikes in the Tri-state area (NY, CT, NJ). These will appear later. The 2019 and 2020 highs stand out, respectively, as low and high, but why and by how much? We take a closer look in Figure 2.1, which shrinks the y-axis (vertical) range and drops the 2014 data. This brings smaller patterns into view.

Fig_2.1 <- suppressMessages(print(ByStates1420(c("United States"), c("Fig. 2.1"), c("2014-12-31"), c("2020-12-20")) +
  scale_y_continuous(breaks = seq(0.10,0.25,0.025),limits = c(0.10,0.25)) + 
  scale_x_date(date_breaks = "2 months", date_labels = "%b %y") +
  geom_hline(yintercept = 0.2071215, size = .5, color = "green4", alpha = .5) +
  labs(title = "Weekly All-Cause Deaths per Thousand - U.S.", subtitle = "Oct 2014 - Dec 19 2020", y = "Deaths per Thousand Pop", x = "Month & Year", caption = "Source: CDC/NCHS; U.S. Census. \n 2020 population is assumed equal to 2019 U.S. census estimate.", tag = c("Fig. 2.1"))))

Figure 2.1 shows a spring 2020 spike in pink. At its apex, it rises by roughly 0.025 DPT above the 2018 peak (blue). A 4-week-long peak of 0.025 adds 0.10 deaths or about 1 percent of the annual total of 8-9 DTP. Spikes in 2015 and 2018 were followed by down years. If this pattern holds, we’ll see the 2020 peak followed by lows in 2021-22. That is, unless – as cancer diagnosis data suggest – Covid-19 interventions drive up deaths from other causes.

In addition, 2019 missed the typical January - March spike. Instead, deaths plateaued. Later, after the summer 2019 trough, they climbed Aug to Jan, then plateaued again – reminiscent of 2016 – in January and February 2020, before spiking well above the 2018 peak between March 28 and April 25. Then they dropped, with an after-shock in mid July to early August, before falling again until mid-October. In short, lots of ups and downs but no 2020 summer trough. This is all at the national level.

How does the picture change if we exclude the Tri-state area? Figure 2.2 tells the tail, as it were. Focus on the flat green line (representing the 2018 peak in flu deaths) and the squiggly purple and pink ones.

Fig_2.2 <- suppressMessages(print(ByStates1420nin(c("United States", "New York City", "New York", "Connecticut", "New Jersey"), c("2014-01-04"), c("2020-12-20")) +
  scale_y_continuous(breaks = seq(0.10,0.25,0.025),limits = c(0.10,0.25)) + 
  scale_x_date(date_breaks = "2 months", date_labels = "%b %y") +
  geom_hline(yintercept = 0.2071215, size = .5, color = "green4", alpha = .5) +
  labs(title = "Weekly All-Cause DPT - U.S. less Tri-state area", subtitle = "Oct 2014 - Dec 19 2020", y = "Deaths per Thousand Pop", x = "Month & Year", caption = "Source: CDC/NCHS; U.S. Census. \n 2020 population is assumed equal to 2019 U.S. census estimate.", tag = c("Fig. 2.2"))))

Removing the Tri-state drops the national spring 2020 peak below January 2018. The line then morphs into a serpentine spring-summer-fall that would make Hogwarts and Loch Ness proud. It’s almost as if the peak that no-showed in January 2019 was “apparated” into spring and summer 2020. Maybe, outside of the first-hit Tri-state – which spiked the 2020 national high over the (green) 2018 flu line – the spring 2020 school and business closures that occurred in many states (e.g., Georgia) merely smoothed deaths into the summer. We can search for the serpent in the state data, though this won’t guarantee that the data can be trusted. First, let’s talk modeling excess deaths.

Excess deaths are estimated, not counted. They are a function of both the estimator’s subjective judgments and objective data. It is widely accepted that estimated excess deaths equals reported deaths from all causes in excess of estimated “expected” deaths over a given time period in a chosen geographical area. In theory, excess deaths may result from some novel cause or causes like, e.g., Covid-19 and societal responses to it. However, the CDC states that Covid-19 may not be the cause of any of the CDC’s estimated excess deaths.5 Similarly, an excess of zero or less is some evidence that no novel cause is killing large numbers of people who would not die anyway. So the debate centers on the meaning and estimation of the plausible range of expected deaths, the proper time and geography subsets to use for comparing expected to actual deaths, how to present the comparison, and what share of the excess, if any, to attribute to particular plausible causes (e.g., Covid-10).

In its October 23, 2020 Morbidity and Mortality Weekly Report (MMWR), the CDC published 299,028 as its estimate of nationwide excess deaths, from Feb 1 to Oct 3, 2020. It attributed ~2/3 of these (198,081) to Covid-19.6 This 198,081 number, apparently based largely on PCR test positives, may be significantly overstated, particularly in light of the WHO’s Jan 20, 2021 Notice warning, contrary to longstanding CDC Covid-19 dogma, that a single PCR test positive alone does not justify a “Covid-19” diagnosis.7

That said, in the MMWR, the CDC states that excess deaths “have been used to estimate the impact of public health pandemics or disasters, particularly when there are questions about underascertainment of deaths directly attributable to a given event or cause.” This implies, misleadingly, that excess deaths are not also useful for exposing suspected over-counts.

In any case, after updating for data available through mid-January 2021, the CDC’s gross estimated excess for the Feb 1 - Oct 3, 2020 period is either 311K or 206K, depending on which expected death benchmark (mean or upper-bound) is used. If we assume that the CDC’s 2/3 multiplier holds for all numbers of excess deaths, these would translate to 208K or 137K gross estimated Covid-19 deaths.

filter(Wtd_CDC_C19_Ex, !Code == "US" & `Week Ending Date` > "2020-01-31" & `Week Ending Date` < "2020-10-04") %>% 
  group_by() %>% 
  summarise(Excess = sum(`Excess Higher Estimate`)*2/3)

filter(Wtd_CDC_C19_Ex, !Code == "US" & `Week Ending Date` > "2020-01-31" & `Week Ending Date` < "2020-10-04") %>% 
  group_by() %>% 
  summarise(Excess = sum(`Excess Lower Estimate`)*2/3)

But this 8-month, gross national estimate, based on multiple by-state-by-week statistical models,8 is misleading. This is partly because it arbitrarily cherry-picks eight months instead of twelve. And not just any eight months. Starting the count just as deaths were about to spike in the Tri-state is like Home Depot reporting profits for only its best eight months. It also implies that a nationwide estimate is meaningful when, because of diversity among states and regions, it is not.

Yet questions regarding the CDC estimate – involving binning, netting, algorithm choice, estimation uncertainty, and training data – run much deeper. And the net impact of Covid-19 where you live is probably much less than the CDC’s national estimate suggests. In most states, according to the CDC’s own “provisional” numbers, the net impact of Covid-19 for the year ended September 30, 2020 was zero or less, if we use the lower side of the CDC’s estimate. What? The CDC’s estimate is two-sided, you ask? Yes, of course. More than two-sided. This is true of every estimate. Numbers can be spun in many directions. Estimates and their interpretation are as much art as science, and estimates cover a range of possibilities. In the CDC’s case, the range is very wide. Let’s consider how and why.

II. Questions & choices

The CDC’s model relies in large part on five subjective choices: 1. time and geography segments or “bins” for prediction and reporting, 2. whether to ignore deficits or net them against surpluses, 3. what to say about estimation uncertainty (e.g., report high and low point estimates or probability distributions over the plausible range), 4. which algorithm or other tool to use in predicting (i.e., “modeling”) expected deaths, and 5. which historical data to use for “training” the model. These choices significantly impact outcomes.

Overshadowing these modeling choices are visceral value-based questions: What is the appropriate “loss function” for evaluating possible regulatory responses to novel causes of mortality like Covid-19? In other words, what should society be trying to maximize or minimize with regulatory choices? Do we want to maximize the joys of association with other human beings or minimize exposure to a single risk (e.g., contracting Covid-19)?

We’ll comment briefly on each of these issues in the sections that follow.

A. Netting & binning

The CDC’s excess deaths model (we’ll call them “surplus” deaths) predicts how many people will die in each week of the year in each state. These weekly-by-state predictions are then compared to reported deaths. If the reported deaths exceed predicted deaths in a given week, that excess counts toward surplus deaths. However, any deficit (i.e., where predicted deaths exceed reported) is counted as zero.9 The CDC then reports the sum of these weekly, gross surpluses as U.S. excess deaths.

Let’s use Georgia as an example. Figure 3, copied from the CDC’s website, illustrates the impact of the CDC’s netting and binning choices on Georgia’s surplus deaths. The blue vertical bars show reported deaths adjusted for the CDC’s assumptions about incomplete reporting. The brown line visualizes the mean CDC expected deaths produced by the model. The differences between the brown line and the top of the blue bars–in yellow highlight–equals the deficit or surplus deaths per week, but only if we use the mean expected-death threshold. If the CDC’s dashboard allowed us to use the upper-bound expected deaths, the brown line would rise and the yellow zone would expand considerably.

Fig. 3

Source: <https://www.cdc.gov/nchs/nvss/vsrr/covid19/excess_deaths.htm>

The mean-based deficits highlighted in yellow were ignored by the CDC’s national estimate of 198K Covid-19 deaths. Ignoring the yellow zone inflates estimated Covid-19 deaths, just as if Home Depot were to report its gross revenues (without deducting expenses) as net income. To illustrate the impact of this one-sided accounting, gross Georgia excess deaths (ignoring the yellow zone) for the two years, 2019-2020, using the CDC’s mean expected deaths would be 14,144, while the net (counting the yellow zone) would be 2,156 less, or 11,988:

filter(Wtd_CDC_C19_Ex, State %in% c("Georgia") & (Year == "2020" | Year == "2019")) %>%   group_by() %>% 
  summarise(sum(`Excess Higher Estimate`))

filter(Wtd_CDC_C19_Ex, State %in% c("Georgia") & (Year == "2020" | Year == "2019")) %>%   group_by() %>% 
  summarise(sum(`Observed Number` - `Average Expected Count`))

By contrast, if measured against the CDC’s upper bound expected deaths, the CDC’s estimate of Georgia’s gross surplus (excess) deaths would be 8,997, with a net deficit of 185:

filter(Wtd_CDC_C19_Ex, State %in% c("Georgia") & (Year == "2020" | Year == "2019")) %>%   group_by() %>% 
  summarise(sum(`Excess Lower Estimate`))

filter(Wtd_CDC_C19_Ex, State %in% c("Georgia") & (Year == "2020" | Year == "2019")) %>%   group_by() %>% 
  summarise(sum(`Observed Number`-`Upper Bound Threshold`))

The CDC attempts to rationalize this surplus-only accounting by arguing that some weekly deficits in some states may eventually flip to surpluses because of “incomplete” reporting. This argument does not pass the straight-face test. Reporting is always incomplete and every organization has accounting cutoffs. Furthermore, data scientists routinely model missing or incomplete data, especially in the health sciences. It’s data science 101, as the CDC well knows. The CDC’s own model “weights” weekly by-state reported deaths by their predicted incompleteness. That the CDC models these reporting lags means they can also model the deficits.

And let’s not forget that the numbers used by the CDC to recommend nationwide lock downs in the spring of 2020 were beyond “incomplete”: they were largely imaginary. So while deficits for some slow jurisdictions (e.g., North Carolina) might warrant “deferred netting,” permanently hiding the deficits finds no scientific justification.

At the national level, by netting estimated deficits against estimated surpluses10 and reporting the entire Sep 30, 2020 year (instead of just eight months), the CDC’s 299K mean-based estimate drops to about 60K.11 Why didn’t the CDC bin annually? One possible motive is suggested by the R package – surveillance – used by the CDC to develop its model.12 But annual binning is not the only option. Figure 2 suggests that total deaths are seasonal, cyclical, and gradually trending upward, naturally cycling through ever-higher highs every 2-3 years. Thus, biennial or triennial bins might be more informative. Whatever the binning window, deficits must be counted.

Beyond netting and time binning, geographical binning also affects the reported surplus. It does so at two levels: prediction and reporting. First, the CDC’s state-based prediction model produces surpluses that are 25% higher than they would be if the model were run at the national level. Second, putting the entire U.S. into a single geographic bucket obscures important regional and state-level differences. For example, using annual binning (9-30 YE) and dropping the Tri-state area from the national estimate reduces national surplus deaths to ~13K. Tri-state added ~41K surplus deaths to the national total. We’ll look more closely at state and regional differences in Part IV.

B. Uncertainty

Two issues are of concern here. First, the CDC MMWR claims an impossible degree of estimating precision, reporting its national (and state) estimates to single deaths, e.g., “299,028,” instead of rounding, e.g., “299K,” or stating a range, “50K-299K.” Given the many subjective assumptions involved, such precision is misleading. While the CDC explains most of its assumptions and notes that its model produces “expected” and “upper bound” point estimates,13 it nevertheless promotes the illusion of precision through its public-facing media like the MMWR. This conveys the false impression that the estimate is more certain than it is. Modeling is a mix of subjective judgment and objective science. The CDC should make this clear.

Second, aside from inflated claims of precision, point estimates – which is all the CDC offers on its dashboard – are inadequate to support responsible policy. Are 10-20K excess deaths more or less probable than 190-200K? While the notes refer to upper-bound and mean expected deaths, the dashboard (see Fig. 3, brown line) shows only the mean value and prevents readers from accessing the upper-bound.

The notes say that the model relies on the historical mean (the “expected value”, in statistical terms) for each MMWR week to establish the expectation for the current week’s deaths. Thus, every week in every state gets it’s own little model. It’s not clear how many data points feed into each weekly model, but – with only five years to pick from – there might be only five (2015-2019). For illustrative purposes, let’s look at those five (and their means) for New York City and Nebraska in week 15.

# Create overlaid histograms with means
# Pull the data
pdata <- filter(pcap_long, Juris %in% c("New York City", "Nebraska") & Cause %in% c("All") & `MMWR Week` %in% c("15") & Year %nin% c("2014","2020", "2021")) %>% 
  select(Juris, Year, Deaths, DeathsPM)
pdata

pdata %>% 
  group_by(Juris) %>% 
  summarise(Mean = mean(Deaths),
            DPTmu = mean(DeathsPM))
NA

Of these numbers, which deserves to be the chosen, “expected” one in Nebraska and NYC in 2020 Week 15? Statistically speaking, none of them. That’s right: None of these numbers was individually probable at all because probability attaches only to ranges, not to point-estimates. For the record, here are the reported Week 15 results:

filter(pcap_long, Juris %in% c("New York City", "Nebraska") & Cause %in% c("All") & `MMWR Week` %in% c("15") & Year %in% c("2020")) %>% 
  select(Juris, Year, Deaths, DeathsPM)
NA

NYC’s reported 2020 7,860 deaths (0.942 DPT) were 7.4 times the 2015-19 mean, 1060 (0.126 DPT), but Nebraska’s reported 343 (0.177 DPT) came remarkably close to its mean, 335 (0.176 DPT). Yet, prospectively, none of the historical numbers or their mean should carry any more weight than the others in establishing the expectation that defines an “excess.” And why limit the benchmarking period to the past five years? Why not ten or twenty? Why not go back to the 1918 flu, which killed 6.6 DPT? There’s no science to the choice; it is completely arbitrary and should be decided through open debate among elected representatives, not by cloistered, unaccountable data modelers.

For a different perspective on the relative plausibility of the CDC’s mean and upper-bound, consider Figure 4.

ggplot(data = filter(GA_CDCvDPH, `MMWR Week` %in% c(1:52)), aes(x = `MMWR Week`, y = Deaths/10600, group = Source, color = Source)) +
  geom_line() + 
  scale_x_discrete(breaks = seq(0,52,2)) +
  scale_y_continuous(breaks = seq(-.10,.10,.01)) +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 7, angle = 0, hjust = 0), plot.caption = element_text(size = 7), plot.tag = element_text(size = 9), plot.margin = margin(.3, .3, .3, .3, "cm")) +
  labs(title = "Georgia Covid-19 vs. CDC upper-bound excess (DPT)", subtitle = "2020 MMWR Weeks 1-52", y = "Deaths Per Thousand", x = "MMWR Week", tag = "Fig. 4", caption = "Sources: CDC; Georgia DPH; www2.census.gov \n 'GADPH' = Deaths reported by GA DPH as Covid-19 deaths \n 'CDC' = Excess of deaths from all causes over CDC upper-bound expected deaths")

Figure 4 shows a close correlation, in weeks 5-48, between Georgia’s reported Covid-19 deaths (classified as such based primarily on positive PCR test results) and the CDC’s modeled, low-end excess deaths estimate (based on the CDC’s upper-bound expected deaths). If we assume the accuracy of Georgia’s PCR-identified Covid-19 deaths and that all excess deaths were caused by Covid-19,14 this correlation suggests that the CDC’s upper-bound expectation is much closer to reality than the mean shown on the CDC dashboard. This points to net excess deaths (from all causes, for the 9-30 year) centered on 19K and 60K for the U.S. without and with Tri-state, respectively.


# Without Tri-state
filter(Ex20wt, Code %nin% c("US", "NYC", "NJ", "NY", "CT")) %>% 
  group_by() %>% 
summarise(Up_Ex_noTri = sum(SmallEx),
          Up_Ex_noTri_DPT = sum(SmallEx)/328e3)

# With Tri-state
filter(Ex20wt, Code %nin% c("US")) %>% 
  group_by() %>% 
summarise(Up_Ex_All = sum(SmallEx),
          Up_Ex_All_DPT = sum(SmallEx)/328e3)
NA

In summary, the inherent uncertainty of these estimates demands more clarity and transparency. We need the full distribution of probabilities over plausible values and the assumptions that support them. We also need an open debate on what they all mean. Modern data science offers multiple tools to make this happen. One of the best is Bayesian networks, about which I have previously written in the Covid-19 context. When so much is at stake for so many, there is every reason to treat uncertainty with the highest degree of data professionalism and transparency. Most importatly, those who make this fateful decision should be held fully accountable, in every sense of the term, to the stakeholders affected by it. This is corporate governance 101.

C. Loss function

Whatever the model, the user must decide how many deaths are too many. What number of deaths requires policy interventions like business closures, social distancing, masks, or vaccination? The CDC’s own excess deaths model points to ten deaths per state per week as the lower bound. The Technical Notes state that weekly counts “between 1 and 9 are suppressed.”15 In other words, nine or fewer deaths are not tracked at all by the CDC. This suggests that the CDC believes that less than ten deaths per state per week should not impact public policy. Another benchmark might be the 1918-19 flu pandemic, with an estimated toll of 6.6 DPT. Through September 30, 2020, the CDC’s own 200K gross, mean-based estimate equated to only 0.60 DPT. After netting, the upper-bound based estimate was 0.18 (0.06) with (without) Tri-state.

This is not to say that ten is certainly the right threshold but merely that there is a number, ten or higher per week per state, below which government intervention is over-kill. The threshold is a value judgment related to an equally judgmental loss function that should balance or “net” costs and benefits.16 Isolating a nursing home patient may prevent a Covid-19 infection at the expense of inducing severe depression or suicide. Shuttering restaurants and concerts may reduce Covid-19 deaths while killing restaurant operators and musicians by starvation and dramatically reducing the quality of life for millions. Oxford epidemiologist Sunetra Gupta calls this a violation of the “social contract:”17

Why don’t we do this for the flu? What’s the contract here? … We want schools to remain open, we want people to flourish, we want inequality not to get any worse. We want the arts… What are we alive for? We put all of these things on a plane … and we say, ‘Okay, we’re going to tolerate this level of disease in the population. It will kill some people. We will try and protect them.’ We just do our best within that context, within those boundaries to try and make it all work." I think that’s the social contract … and that’s what’s being ignored."

How do 0.06 to 0.85 deaths per thousand fit into the social contract? Where do you draw the line?

D. Algorithm

The CDC uses the Farrington algorithm to model expected deaths but does not identify alternatives or compare Farrington to them.18 Schumacher et al, cited in the CDC’s Technical Notes, say that Farrington “only performs one time-point detection” of outbreaks.19 In order to fully and fairly evaluate the model, we need to see the code. The CDC should publish its R code and explain why the Farrington method is suitable at all and why it is better than alternatives.

E. Training Data

Like all models, the CDC’s excess deaths model was “trained” on historical data. However, it’s unclear which data the CDC used. The Technical Notes say both “2013-present” and “2015-2019” but provide no link to the data itself. We need access to the data. We also need access to data going back at least to 1918 to cross-check the appropriateness of limiting the model’s historical knowledge to just the last five years.

III. Data & Code

The Part IV analysis20 – primarily in the form of visualizations and commentary – focuses on three bodies of data: 1. U.S. mortality from all causes by age, time, and place (NCHS),21 2. Excess Deaths Associated with Covid-19 (NCHS),22 and 3. a Covid-19 tracking spreadsheet produced by the Georgia Department of Public Health. Population denominators for per-thousand figures and, for cross-validation, annual deaths, are from the U.S. Census.23

The code used to wrangle and transform the data appears in the Appendix. All coding was done in R Studio using tidyverse24 and other R packages identified in the first code chunk. The code can be hidden or revealed by toggling the Code button in the top right corner of this html document and before each code chunk. The full R Studio .Rmd file can also be downloaded through the same button. I provide the code, submit to the MIT open source license, to encourage others to extend the analysis.

IV. Analysis

In Part IV, we’ll look more closely at deaths by time period and jurisdiction using simple visualizations. New York City (NYC) is treated as a separate state, so New York state (NY) excludes NYC.

Parts IV. A-B deal with total deaths per thousand, without regard to modeling the excess. Part IV.C digs directly into the CDC excess deaths model and data table. In Parts A-C, we see that 2020 mortality in the U.S. varied by state, region, and time periods mostly inside narrow bands, with the Tri-state area being the far-out exception. In most places, in most weeks, 2020 deaths from all causes differed from 2014-2019 by less than 0.10 DPT. Within this range, several regional patterns are distinguishable, with the northeastern United States clearly standing out in terms of maximum weekly DPT. The table below sorts 2020 reported weekly DPT (which the table calls “DeathsPM”) from high to low. Only four of the top 50 weekly scores – in MS, LA, and TN – are located south of Washington, D.C. (38 deg. north latitude).

filter(pcap_long, Cause == "All" & !Juris == "United States" & Year == "2020") %>%
  select(c(1,4,7,10)) %>% 
  arrange(-DeathsPM)