Updated: Sep 16, 2020
Understanding Covid by looking at Excess Deaths
I've been writing a series of articles that explain common elements of Covid models and how various characteristics impact Covid outcomes. In spite of my best efforts, there continues to be confusion around the true cause of death. Is Covid really that bad? Isn't it 'just like the flu?' Were pre-existing conditions the reason the individual died?
I believe people also struggle with the 'low' numbers and probabilities. You often see statements on social media that 98% of the people getting Covid survive, so what's the big deal? As humans, we generally struggle with scale. It's easy to appreciate the impact when someone close to you dies from Covid, but without a human connection, they're just numbers. When given without context, they can seem less important.
If I told you that a disease infected one out of ten people that you came into contact with and nearly 98% of those infected survived, it probably wouldn't seem like much to worry about. If I told you a disease killed 675,000 Americans in one year and lowered life expectancy by a decade, it would probably get your attention. Both examples come from the Influenza epidemic of 1918. The percentages and raw numbers don't seem like much, but when scaled across the entire nation, it's a very big deal. Covid percentages and probabilities might not sound too worrisome, but when scaled across the country, they become quite concerning.
We're going to try and understand the scale and risk of Covid by examining something called, 'excess deaths.' Think of expected deaths as the range of average deaths over the past few years. There's a bit more to it than that, but it's enough to understand the concept. Deaths above this range are 'excess deaths.' We're effectively looking for deaths that are outside 'normal' for a typical year.
The CDC publishes a dataset of expected deaths and actual deaths by week and state going back to January 2017. We can use this data to do some exploratory data analysis (EDA) and determine how much, if any, Covid-19 has added to deaths in the United States.
To solve this problem, we're going to apply a simplified version of the scientific method. I have a hypothesis: Covid-19 has led to a higher than normal death rate. So we can grab some data and start to prove that hypothesis, right? Nope. The scientific method is about reducing bias. In order to reduce bias, we must try to prove the negative of our hypothesis. This is called the null hypothesis. For our example, the null hypothesis is that Covid-19 hasn't led to more deaths. We are trying to prove that deaths since Covid are within the normal range of expected deaths. If we succeed, then we accept the null hypothesis and reject the hypothesis that Covid has led to higher than normal deaths. If we fail, then we reject the null hypothesis and accept that Covid has led to additional deaths.
That was a lot, so let's get started.
When looking at statistical data, you'll often hear references to the standard deviation. Standard deviation is just a fancy way of referring to the expected range. It's convenient, because it can define what we mean by 'normal'. Instead of standard deviation, let's refer to it as the range. The first level (one standard deviation) encompasses 68% of all outcomes. That means that if I filled a lotto spinner full of numbered balls and started pulling them out, 68% of the numbers I pulled out would be in the first range. If I double the range (two standard deviations), then 95% of all the numbers I pulled out would fall inside the doubled range. If I triple the range, 99.7% of all the numbers would fall inside the tripled range. In most cases, anything outside of the double range (two standard deviations) is considered statistically significant. So, if I pull a number out, and it's not within the range defined by two standard deviations, then it's 'abnormal.' This is often referred to as the confidence interval. The CDC has conveniently calculated the upper bound values for us, so we can produce a graph of expected deaths, based on the weekly expected count and a 95% confidence interval. (We drop the last few weeks due to reporting issues.).
The graph has a shaded area representing the normal range (two standard deviations). We expect the actual count to fall within the range, with an occasional exception. Think of the gray area as a river, and the line of actual deaths a boat traveling through the river. Boats belong in the river. If the boat leaves the river longer than a week or two, it's not 'normal.' It's what a statistician would call statistically significant.
There's a flu epidemic that lasted about four weeks in January of 2018. The boat leaves the river, and then comes back in shortly after. Then there's the start of the Covid-19 epidemic around March of 2020. The boat doesn't just leave the river. It drives into town and starts hanging out at the bar. It's still there! We're going to apply statistical tests against these values, but the graph doesn't bode well for our null hypothesis.
Notice that deaths spike in winter and decline in summer. This is referred to as seasonality. Statisticians call these types of datasets 'non-stationary,' which just means that they have a trend or a pattern. Most data models are built on the assumption that data is random, so these types of patterns usually require special treatment. We're not building predictive models or doing a research project. We just want a high-level understanding of relationships and concepts, so we'll ignore the peaks and troughs for now.
Our null hypothesis is that Covid deaths are within the normal range. Clearly, that's not the case with our chart. But could the chart be a fluke? Is it that odd aberration the mortality equivalent of a 100 year flood? We want to know if this type of spike could occur naturally, given enough years of data. If somebody flips enough coins, that person will occasionally get ten heads in a row. Could a person occasionally see this high of a death count, if they stuck around for, say,10,000 years?
We can't wait 10,000 years to find out, but we can create a simulation of what would happen, should we live through 10,000 years of this (Heaven forbid). There are two ways we can do this. We can create a 'lotto bucket' of possible outcomes and randomly select from them 10,000 times. The alternative is to shift the mean, or the average. To shift the mean, we 'slide' the Covid data over the actual data, so that we can compare them, and then test to see if they could be the same.
In the graphic, the solid blue dots represent excess deaths during the Covid period. The solid red dots represent excess deaths before Covid. The shaded dots (which look purple when they overlap) are 50 of our10,000 samples. The important thing to notice is that none of the solid blue dots overlap with the shaded samples. The lack of overlap is telling us that the higher death counts that occur during Covid never occur in the random samples. If they did, they would overlap. The p-value of the simulation, which tells us what percentage of deaths during Covid occur in normal times, is zero. The shifted means test comes out the same and also has a p-value of zero. The death counts we see during Covid never occur in our 10,000 simulations. Based on the results of these two tests, we have to reject the null hypothesis that there is no difference between deaths in the past and death counts during Covid. We must accept the original hypothesis that the deaths during Covid are higher than normal.
Just because two things happen at the same time doesn't mean that one causes the other. For example, when ice cream sales increase, shark attacks increase. This doesn't mean that ice cream makes swimmers taste better to sharks. It likely means that more people eat ice cream on hot days, and more people swim on hot days. More swimmers mean more opportunities for shark attacks. Eating ice cream doesn't cause shark attacks.
Can we test to see if Covid is the cause of the extra deaths? Unfortunately, there isn't a specific test for causality. Instead, scientists rely on basic principles. First, is there a strong relationship? For purposes of this article, we're going to accept the strong relationship. As the graph below shows, Covid deaths increase shortly after Covid cases increase, and deaths trend downward shortly after cases trend downward. (The larger gap is likely due to improved testing. In the early days of the virus, only the very sick could be tested.) There are tests that can be done to verify this, but they're beyond the scope of this article. Also, there aren't other explanations. Pre-existing conditions existed in prior years, so if they were contributing to the excess, it would be national news. There's nothing else happening that can account for hundreds of thousands of excess deaths, and the problem is global. Since the problem is global, it also matches the requirements of consistency--different locations must produce similar effects.
The relationship also needs to be temporal, meaning that the cause must precede the effect. We can accept this, since the chart shows case increases preceding deaths. Likewise, the reverse must be true. When cases go down, deaths must go down. As the chart shows, deaths go down shortly after cases go down. Other requirements include coherence with known facts, biological plausibility, and higher exposure leading to a higher proportion of people being affected. I think we can accept these criteria as well. Obviously, in a real study, these requirements would be tested. In the interest of wrapping up the article, though, I'll leave that as an exercise for the reader. Since Covid meets the principles of causality, and there aren't any alternative explanations, we have to accept that Covid-19 is causing the extra deaths. (By the way, the chart multiplies deaths by 10, so you can compare them side-by-side. Without multiplying them by 10, the Covid deaths would just be straight line along the bottom.)
Subtracting the average expected count from the observed number, there are 225,995 higher than normal deaths between the beginning of March and the first week of August. If we use the upper bound, there are 179,098 more deaths. The official death count on August 9th was 162,425. Comparing death counts, we can estimate with confidence that from 179,000 to 225,000 deaths are directly attributable to Covid. Our analysis strongly suggests that official counts underestimate actual Covid deaths by 15,000 to 60,000 people.
As the very first chart showed, Covid has taken a dramatic toll on our country. Other than a short-lived flu epidemic in 2018 that resulted in 15-20,000 deaths above normal, death counts are pretty consistent. Even though the percentage of people impacted by Covid or dying from Covid might seem small, the effects are dramatic. It is clearly not 'like the flu.' Flu deaths are included in the counts from prior years, and they don't show the spikes we see from Covid. Likewise, pre-existing health issues exist in the prior years, and they don't create massive spikes. The extra deaths cannot be accounted for by suggesting that pre-existing conditions caused the deaths. It might seem like health officials are overreacting or dramatizing the effects of Covid, but the data clearly shows that Covid is a serious health threat. Health officials are justified in urging caution and recommending safety measures.