Predicting the spread of COVID-19 Coronavirus in the US (live updates)

Published in

Analytics Vidhya

9 min readMar 18, 2020

UPDATE 3/27/2020

After 3 days of quite accurate predictions, everything went wrong yesterday, a complete diversion from the model caused mostly by a surge of NY State reported cases. Interesting that the 4 day prediction from 3/22 was much more accurate. Also, for a while I’ve said we’ll pass 80k cases on March 26.

At this point I have again low confidence in the current formula and will need to wait a few days to see where it’s going, but just in case here are the upcoming predictions for US:

03/27: 97,470
03/28: 111,968
03/29: 125,293
Peak day 3/26, Max cases 178,622

UPDATE 3/26/2020

Things are looking way too good from a statistical point of view now. On March 25, total US confirmed cases were 65,778, which is 1.8% more than I predicted the day before, and 2.8% less than the 2-day prediction. The model points to a peak day sometimes between the 24th and 25th so I’m still waiting to see if the number of new cases starts decreasing.

Compared to yesterday, the estimates are just a bit higher but the logistic function equation doesn’t change much. Here are today’s predictions:

03/26: 76,503
03/27: 86,392
03/28: 94,812
Peak Day 3/24, Max cases 120,621

UPDATE 3/25/2020

Again a pretty small error yesterday, 53,740 cases vs. the 55,285 predicted (-2.8%). Getting confident that the logistic function parameters are correct (especially as the error goes down), and there’s not gonna be anymore surprises from expanded testing as it appears US has been testing pretty well. Data coming slightly under the predicted values means peak day might be sooner, as early as today or tomorrow, and then the curve starts flattening (as long as we keep fighting the virus — that is what Verhulst theory is about).

Today’s model shows we’re right at peak day (though the exact peak is hard to define and it could be plus/minus 2 days) for new cases reported and the forecast for the next 4 days is:

03/25: 64,602 (real: 65,778, error: +1.8%)
03/26: 74,668
03/27: 83,536
Peak Day 3/24, Max cases 111,967

It’s very important to see what happens the next couple of days, and whether the predictive model still holds. If it does, then it’s very good news.

UPDATE 3/24/2020

Well, the one day when I said I’m not confident in the model prediction, it was right on point! 43,667 confirmed cases in America, for a tiny error of just 0.4%! Looking back at the 3 possible scenarios I mentioned yesterday, this points to #2: The growth curve is close to peaking and will start to flatten a lot sooner. Right now we’re into a very good pattern and the function’s error is decreasing significantly.

Therefore the current predictions are:

03/24: 55,285 (real: 55,285; error: -2.8%)
03/25: 67,760 (real: 65,778, error: -2.9%) — predicted peak day
03/26: 80,334 (real: 83,836, error: +4.3%)
Peak Day 3/25, Max 138,493

If this holds for another day or two, it means we’re on our way out of this. True, the total number of US cases will more than double but the rate of growth will start fading away after March 25. Peaking on Day 35 is also consistent with what happened in other countries. This doesn’t mean this weekend we should run out and hug everybody. We have to keep doing what we’ve been doing until the sigmoid function makes that hard turn towards flattening which, in this case, means until around April 10th.

UPDATE 3/23/2020

The number of cases came suspiciously lower than the prediction, 6.7% less, so there are 3 possible explanations:
1. Testing kinda slacked on the weekend and that’s why there’s less reported cases.
2. The growth curve is starting to flatten and soon new cases will start going down.
3. Growth is still in its early stages and will accelerate soon.
Not making any confident predictions today, waiting for tomorrow’s update. If I were to pick the number that the model shows, that is 43,870.

UPDATE 3/22/2020

First of all, thank you Analytics Vidhya for picking up my story! Now… Yesterday I had to make a few changes to the bounding parameters of the SciPy curve fitting function, and got a much more accurate prediction! The estimate for March 21 was well within my desired range (of +/- 5%). US had 25,493 confirmed cases versus my prediction of 26,505, that was 3.8% less. The model now predicts slightly lower values since we came under the estimate yesterday. The total number of cases stays about the same but peak day is pushed to March 29.

03/22: 35,676
03/23: 49,013
03/24: 66,561
Peak day 3/29, Max cases 463,772

UPDATE 3/21/2020

I couldn’t make any confident US predictions for a few days because the data was not very consistent (mostly caused by increased testing but also a few data reporting irregularities). By now I think it’s in much better shape. I’m also using a different feed for the Johns Hopkins data, which is more frequently updated. So with data through March 20, confirmed cases predictions for US are:

03/21: 26,505 (real: 25,493, error -3.8%)
03/22: 37,068
03/23: 51,348
Peak day 3/28, Max cases 464,773

Based on increased testing I wouldn’t be surprised if the actual number would be 2–5% higher than the model predicts.

UPDATE 3/18/2020

After the first predictions came out quite accurate (within 2% of the ground truth, increased testing in the US shifted the confirmed number of cases and affected prior predictions. I’ll need a few days so the skewed data forms a new pattern, but using data up to 3/17/2020, the model predicts the following for:

Wednesday, March 18: 8,271 confirmed cases
Thursday, March 19: 10,612 cases

Data Overview

I’m going to try to make this short and leave room for daily updates as data keeps coming in. It all started with me looking at the COVID-19 charts put together by Worldometer. The data is still very limited, as the epidemic only started about 2 months ago. The first 2 countries that seem to have gone through the entire evolution cycle are China and South Korea, and they are the only two datasets that can be considered training data. It’s not even enough for a machine learning algorithm but plain old visual reasoning. Here are their raw data for daily new cases:

See the pattern? Keep in mind that China has that Feb 13 anomaly when previously unreported data from one region were all dumped the same day so that spike would have to be distributed over the previous days.

Fitting the Curve on the Logistic Function

But it all looked to me as it as the total could be approximated with the logistic function, a sigmoid function used to distribute probabilities in logistic regression. The function is defined by this formula:

So then I tried to fit it for the total cases of Coronavirus in China and South Korea. I used the dataset available on Kaggle, which is sourced from the official data published by Johns Hopkins University. And guess what? It’s almost a perfect fit!

As you can see, it even smoothes over that China anomaly. Why does it fit so well? It’s because the logistic function is a common model of population growth, originally formulated by Pierre-François Verhulst in 1838 to describe the self-limiting growth of a biological population. The virus IS a biological population, and it’s limited by the human body’s reaction to it, and the community’s fight against the epidemic. In other words, the illness runs its course.

Initial Predictions on March 11, 2020

I first took a stab at this on March 11th, and here are the predictions I got for US cases. See how well it fit the Covid19 data up to that point:

I’ll consider this Prediction Zero, the first time I ran the program. Was it right? Wrong? Here’s how those first USA Coronavirus predictions turned out to be. First number is the program’s prediction, second number is the real number, with the error:

3/12: 1639 | 1663(+1.4%)
3/13: 2129 | 2179(+2.3%)
3/14: 2731 | 2726(-1.8%)
3/15: 3447 | 3499(+1.5%)
Peak day: 3/18, Max 12809

Now there are big issues with this simulation. The dataset for America is still very small. A lot of people haven’t been tested so the number of confirmed cases might spike once testing gets more widespread. That would cause an anomaly similar to the Feb 13 one in China. But the logistic function will still apply to it, only it will be pushed higher and longer in time.

That’s why I’ll keep running this every day and post the results here, so we can track the accuracy and the estimated shape of the curve. If you want to run it at home, here’s my code snippet in Python, using Matplotlib for graphics and SciPy for curve fitting. I recommend putting this into a Jupiter Notebook as I did, for faster and easier visualizations.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from scipy.optimize import curve_fit
from datetime import datetime, timedeltadata=pd.read_csv("covid_19_data.csv")
data=data.drop('Last Update', axis=1)
data=data.drop("SNo",axis=1)
data=data.rename(columns={"ObservationDate": "date", "Country/Region": "country", "Province/State": "state","Confirmed":"confirm","Deaths": "death","Recovered":"recover"})def plot_predict(country, stat, future_days):
    def avg_err(pcov):
        return np.round(np.sqrt(np.diag(pcov)).mean(), 2)    # function to be minimized
    def f_sigmoid(x, a, b, c):
        # a = sigmoid midpoint
        # b = curve steepness (logistic growth)
        # c = max value
        return (c / (1 + np.exp(-b*(x-a))))
  
    inception = 0
    # hardcoding day 0 for several countries based on observations
    if country=="South Korea": inception = 8
    if country=="US": inception = 28
    if country=="Italy": inception = 20
    country_data = data[data["country"]==country].iloc[: , [0, 2, 3 ,4, 5]].copy()
    country_graph = country_data.groupby("date")[['confirm', 'death', 'recover']].sum().reset_index()[inception:]
    y = country_graph[stat]
    x = np.arange(len(y))
    
    # fitting the data on the logistic function
    popt_sig, pcov_sig = curve_fit(f_sigmoid, x, y, method='dogbox', bounds=([12., 0.001, y.mean()],[60., 2.5, 10*y.max()]))
    print(popt_sig)
    peakday = datetime.strftime(datetime.strptime(country_graph["date"][inception], "%m/%d/%Y")+timedelta(days=int(popt_sig[0])), "%m/%d/%Y")
    plt.figure(figsize=(16,8))
    
    x_m = np.arange(len(y)+future_days)
    y_m = f_sigmoid(x_m, *popt_sig)    print("Predictions:")
    for i in range(1,5):
        pday = datetime.strftime(datetime.strptime(country_graph["date"][inception], "%m/%d/%Y")+timedelta(days=len(y)+i-1), "%m/%d/%Y")
        print("%s: %d" % (pday, y_m[len(y)+i-1]))
    #print(country_graph)
    
    # creating the matplotlib visualization
    plt.plot(x_m, y_m, c='k', marker="x", label="sigmoid | error: "+str(avg_err(pcov_sig))) 
    plt.text(x_m[-1]+.5, y_m[-1], str(int(y_m[-1])), size = 10)
    
    plt.plot(x, y, c='r', marker="o", label = stat)    plt.xlabel("Days")
    plt.ylabel("Total Infected")
    plt.legend(prop={'size': 15})
    plt.title(country+"'s Data", size=15)
    plt.axvline(x[-1])
    plt.text(x[-1]-.5, y_m[-1], str(country_graph["date"][len(y)+inception-1]), size = 10)
    plt.axvline(int(popt_sig[0]))
    plt.text(int(popt_sig[0]), 1, "peak: day " + str(int(popt_sig[0])) + " (" + peakday + ")", size = 10)
    plt.show()# See the results for different countries
#plot_predict("Mainland China", "confirm", 10)
#plot_predict("South Korea", "confirm", 10)
plot_predict("US", "confirm", 30)

The way it works is by minimizing the error of the standard sigmoid function, while using a few constraints learned from the China and South Korea data (defined in the bounds parameter of curve_fit). I also set the inception data of the epidemic in each country so the curve fits better (by inspecting the day-by-day cases to see where it starts growing).

Conclusion

While I’m not a medical expert, just a data scientist and software developer, if the math holds, then the long-term threat of the virus is not as bad as the media would portray it. In China it peaked after 17 days (which is probably skewed by the reporting anomaly) and in South Korea after 31 days. The Coronavirus growth followed the logistic model with a symmetric exponential growth up to the peak day and a logarithmic flattening after that. If that holds for the US as well, we should be good by mid-April. I’m not sure how much enforcing the isolation measures affect the timing of the flattening of the curve (as it’s implied by Verhulst’ theory).

The big test for USA will be around March 18–20, to see if indeed the maximum logistic growth happens and the curve turns towards flattening. This could be affected by the expansion of testing, but I believe the pattern will hold.

Keep checking back to this story on a daily basis as I’ll try to update my predictions as often as Johns Hopkins published their last-night data.

Predicting the spread of COVID-19 Coronavirus in the US (live updates)

Written by Chris Fotache