Updated: Sep 8, 2020
There has been a post floating around social media indicating that the CDC quietly updated the COVID numbers and admitted that only 6% of all the recorded deaths are due to COVID. The post suggests that COVID deaths including comorbidities are not truly caused by COVID. The post also carries a conspiratorial tone, suggesting the CDC changed the data in secret. Even the President retweeted this post, and Twitter — correctly — removed it as false information. Eventually, Dr. Anthony Fauci weighed in to debunk it. There are a lot of things wrong with this post. It has false information; it misleads people about what these numbers actually mean; and it implies that there is some conspiracy to hide this information, even though the data has been available for months. Within my own social media circle, the post has generated a lot of discussion and questions about how to interpret the data. I thought it would be helpful to walk through building a model that examines the impact of medical conditions on COVID outcomes. It’s a way to educate others about influential variables and how they are used in the Data Science field. I don’t have case-by-case data for the individual comorbidities, but I did find case-by-case data that lists medical conditions as one of the criteria. It also lists sex, race, and age, so we can work through those as well. In order to keep things simple, I eliminated situations where data was missing or unknown. I also limited the data to positive test results, ignoring probable case results. We’re trying to understand the relationship between the variables and how they predict outcomes, so the missing and unknown data just complicates the interpretation.
This will take some time, so I’ll enter information as a series of articles. The dataset I’m using comes from the CDC. I’ll post the code in Github once the articles are completed.
What’s a data science article without a Princess Bride reference? When I first thought about this project, I was thinking about linear regression, which can tell you what variables contribute to an outcome and how much they explain that outcome. For example, I can use linear regression to predict the value of a home based on number of rooms, square footage, amenities, etc. However, the data that we’re using only has two outcomes: dead or not dead. There isn’t an almost dead, somewhat dead, or mostly dead all day, so we have to use a different model. This model we’ll use is called logistic regression. The logistic regression model provides output as odds, so we need to define terms.
There are odds, odd ratios (OR), and probabilities. These concepts are related, and if you know one you can figure out the other, but people often get them confused. Let’s say someone tells you that the odds of winning are 3 to 2 (3:2). This means that for five outcomes, you should expect three wins and two losses (WWWLL). Or I could say that for every three wins we can expect two losses. The odds ratio is the two numbers divided together. In this case 3/2 = 1.5. So the odds ratio is 1.5 to 1. This is just a fraction. Like any fraction, if I multiply the top and the bottom by the same number, it doesn’t change the value. So, if I multiply 1.5/1 by 2/2, I get back to 3/2. Probability is the likelihood that an event occurs considering all the outcomes. In this example, there are five outcomes (WWWLL). Three outcomes are wins. The probability, then, is 3/5, or 60%. You can expect to win 60% of the time and lose 40% of the time. If you care, you can get to the odds from probability by dividing the probability by one minus the probability. So, .60/(1-.60) is .6/.4, which is 1.5. To go back to probability from odds, I just divide the odds by 1 + the odds. So, 1.5/(1 + 1.5) = 1.5/2.5 = .60. This WikiHow article provides a good visual explanation. The logistic regression model provides answers in terms of odds, so you need to understand odds and probabilities to interpret the results.
Odds are wins to losses. Probability is wins divided by all events. As an example from the dataset, after clearing out the missing and unknowns, we are left with 160,517 records. Of these, 22,034 resulted in deaths, so 138,483 people survived. For this dataset, the probability of dying is the number of deaths divided by total positive tests: 22,034/160,517, which is 13.73%. We are not calculating mortality rates. This probability only applies to this data. We excluded a lot of records that contained incomplete information or lacked a positive test result. We’re working towards understanding the relationships between variables, so don’t get hung up on whether or not this probability matches the probabilities of total cases seen in Worldometers or some other source. We’re solving a different problem.
Now that we know the probability, we can calculate the odds. Using our formula, we end up with odds of 0.159 to 1. So, for every 0.159 positive tests that pass away, 1 person survives. This is not a particularly helpful way of thinking about things, since most of us don’t know any partial people. Remember, though, these are just fractions. I can multiply the top and bottom by 1,000 and not change the fraction. Now the odds are 159 to 1000. For every 159 people in this dataset that die, 1,000 people live. That’s a little bit easier to wrap our collective heads around. This is a probability for a specific data set. What we’re working towards is an equation that predicts the likelihood of passing away and explains which variables have an impact and how much impact they have. The probabilities and odds we calculate with the model won’t be the same as what we just did, but they should seem to make sense together as a sanity check.
I’ll stop here and devote the next article to adding medical conditions into the model. We’ll examine how existing medical conditions increase or decrease the likelihood of dying from COVID and what that means in reference to the comorbidity tweet.