Covid Tracker: Personalized alert system for COVID-19 risk

Written by Alvin, Kuan, Sheldon Cooper — Feb 08, 2021

Alvin Yang
The Startup
8 min readFeb 7, 2021

--

Introduction

Since the beginning of the SARS-2 COVID-19 pandemic, data analytics teams have rushed to create data visualization tools, most notably the Johns Hopkins University tracker, but that has not been able to deliver adequate information on a United States county level. This made it difficult for an individual to make day-to-day decisions.

This article describes a proof-of-concept COVID-19 tracker written in Python, which can potentially be developed into a mobile application in the future. The app has the ability to send an alert to the user if the risk of COVID-19 spike over the next 7 days at a specified US county is above some threshold. The way we define this risk will be discussed later.

Prototype of the app

The app shall have 3 main features:

  1. A jumbotron showing the predicted increase/decrease of COVID-19 cases in percentage for the next 7 days relative to the previous 7 days.
  2. A search bar where the user can use to search for a county they wish to track. If that county has a spike in COVID-19 cases, then the app will alert the user.
  3. An interactive heat map of COVID-19 risks in all counties of the state the tracked county is belonging to. The user may be able to click on a given county, which will take the user to another page displaying various COVID-19 statistics (e.g. number of new cases in the previous week, etc.).

Our COVID-19 tracker intends to bridge the gap in information while delivering COVID US county data in a digestible and informative manner. The tracker takes data from the COVID-19 cases per US county from a New York Times’s GitHub repository. It utilizes a forecasting model using Facebook Prophet on the open source data to predict the change in number of COVID cases over the next 7 days in the tracked county. Next, the tracker will aggregate the predicted risks of the tracked county and its neighboring counties to compute the overall risk of the tracked county over the next 7 days. The motivation here is to model human flow between neighboring counties.

Forecasting COVID-19 spike in a county

Firstly, the COVID-19 data was scraped from the NYT GitHub. The tabular data records information on daily statistics related to COVID-19 for each county over a range of period, such as in the example below.

Statistics for Nassau county

We then created a new feature called cases_delta, which is simply the change in the number of cases per daily basis. It is computed as the difference in numbers of cases in a given day and the day before.

Timeseries of cases_delta for Fulton county

We defined the risk of COVID-19 spike over the next 7 days as the ratio between the expected number of new cases in the next 7 days and the number of new cases in the previous 7 days. The reason why both sliding windows are 7 days in length is because this period coincides with a week, thus the weekly seasonality effect will not be accounted. If the ratio is above some threshold, the app will send an alert to the user and the user can then check the updated percentage ratio and other relevant statistics within the app, allowing them to make plan ahead their week.

In order to predict the number of new cases in the next 7 days, we built a time series regression model. The model consists of two main model components: trend and seasonality. With time as a regressor, it fits several linear and non-linear functions of time as components.

In other words, the procedure’s equation can be written as:

For this purpose, we use the Facebook Prophet library for the time series regression model. The Prophet model is fitted on the historical data (i.e. cases_delta) with a daily timestamp. We adopted a piecewise linear trend and leave it to Prophet to automatically find the change points. Note that a change point is the point where two piecewise linear models representing trend are stitched together. The yearly_seasonality is set to false because our data is not of sufficient length to capture the effects of annual seasonality. However, we did consider monthly and quarterly seasonality.

A plot of the forecast model is shown below where the y-axis is the cases_delta and x-axis is the timestamp. The forecast period is set to 30 days from 02/06/21. Note that the red vertical lines represent the positions of change points automatically computed by Prophet.

Forecast of cases_delta over the next 30 days from 02/06/2021

As stated previously, the model will be used to predict the risk of COVID-19 at a given county for the next 7 days. The risk is quantified as a ratio between the sum of new cases in the next 7 days over the sum of new cases in the previous 7 days. A positive ratio indicates a greater risk to the user since it implies that the number of new cases is expected to increase. If this ratio ever exceeded some threshold, the app shall send an alert to the user. Keep in mind that this is refreshed daily, so the user can expect to receive an alert only one time in a day.

Predicted risks in counties of Georgia state over the next 7 days period from 02/07/2021 based solely on the data of individual counties without accounting for interactions between neighboring counties

Estimating risk from neighboring counties

Now we have a forecast model that can be used quantify the risk to the user over the next 7 days based on data from just the single county which they are tracking. However, a county does not exist in isolation. There are human flows between it and neighboring counties, which means that the risk of COVID-19 spike in a given county is correlated to the risks in its neighboring counties.

Color map displaying the correlations between Fulton county and neighboring counties in terms of predicted risks over the next 7 days from 02/07/2021

The role of human flows in spreading the disease is quite complex, so we attempted to capture it by training a machine learning model. In other words, the predicted risk for a given county is a function of the risks from its neighboring counties and itself as outputted by the Prophet models. This function is the machine learning model. We will be considering a few machine learning models and then select the most appropriate one.

For a given tracked county, we consider all neighboring counties in the same state as the tracked county. We will then let the machine learning model to decide the impact each county has on the tracked county in terms of the forecasted risks. This way, the model will implicitly learn how trustworthy the fitted Prophet model for a particular county is.

For example, the county of Fulton in Georgia has 10 neighbors, namely: ‘Carroll’, ‘Cobb’, ‘Coweta’, ‘Douglas’, ‘Fayette’, ‘Clayton’, ‘DeKalb’, ‘Gwinnett’, ‘Cherokee’, and ‘Forsyth’. We generated a training matrix where the 11 features are the predicted risks of all 10 neighbors and the county of Fulton itself. Recall that a predicted risk for a given county is defined as a ratio of the predicted number of new cases over the next 7 days and the number of new cases over the previous 7 days, where the prediction is made using the Prophet forecasting model.

Each row of the training matrix represents the predicted risks of neighboring counties and Fulton county and the actual risk for Fulton county over a sliding window of 7 days. For instance, row 13 below represents the forecasted risks of Fulton and its 10 neighbors (i.e. fg, fg_0, fg_1, fg_2, fg_3, fg_4, fg_5, fg_6, fg_7, fg_8, fg_9, respectively) for the next 7 days starting from 06/20/2020. The actual risk recorded in the NYT database over this period is the value of the “ag” feature.

Predicted and actual risks of Fulton county and neighbors

Now we are ready to train a machine learning model where the X matrix consists of all the features except for “ds” (timestamp) and “ag” (actual risk), while the Y matrix consists of just the “ag” feature.

Firstly, we investigated linear regression and multilayer perceptron (MLP). We evaluated the performance with mean squared error (MSE) and found that they were not satisfactory. Thus, we decided to train a ridge regression model. The optimal hyperparameters setting was obtained through a grid search cross-validation.

Best hyperparameters setting: {'alpha': 0.1} MSE of the best hyperparameters: -0.009720928990920035

We can now use the trained ridge regression model to predict the risk over the next 7 days for Fulton county while accounting for the predicted risks in neighboring counties.

Predicted: 1.0987040020810637, Actual: 1.222059866605881

Now we are ready to apply the model to a subregion of Georgia state. Recall that for each neighboring county as well as a tracked county, we use the Prophet models to predict their risks over the next 7 days. These predicted risks are then inputted into the ridge regression model, which was specifically trained for the tracked county in order to consider the human flows factor, to predict the net risk for the tracked county over the next 7 days.

Predicted risks for counties in Georgia over the next 7 days from 02/07/2021

Conclusion

A proof-of-concept COVID tracker was developed which can potentially be further developed into a mobile application. The COVID tracker is able to forecast the risk of COVID surge in the next 7 days for the tracked county, which can be set by the user on the app. The risk of COVID surge over the next 7 days is defined as a ratio between the predicted number of new cases in the next 7 days over the actual number of new cases in the previous 7 days. The forecasting for the number of cases was performed by utilizing the Facebook Prophet library. A model was fitted for each county by accounting the trend and seasonality of past daily number of cases for that county.

We found that the predicted risk of each county is correlated with the predicted risks in its neighboring counties. This meant that human flows between counties do affect the predicted risk in the tracked county. To account for this, we trained a ridge regression model for the tracked county so that it will learn the influence of neighboring counties to the predicted risk. The goal is for the ridge regression model to learn the trustworthiness of the fitted curve from Prophet for each neighboring county.

Finally, we used everything to predict the net risks over the next 7 days for a subset of counties in Georgia state.

--

--