An overview of SIR model
Have you wondered how our fellow data scientists are able to forecast the peak of the pandemic? Have you wondered how they are forecasting the confirmed cases with a MAPE as low as 0.1%?
Forecasts could be made using various time series algorithms like ARIMA, Exponential Smoothening, etc. But do these algorithms capture the real behavior of the disease? Do they take into account the growth of the recovered population?
That is where SIR model comes into picture. To capture the real behavior of a pandemic situation.
A brief introduction about SIR model:
SIR stands for Susceptibles, Infected and Removed.
Consider a closed room scenario of N people and let’s divide them into 3 categories/buckets.
The first category is the Susceptibles S(t)- all of the people who are capable of becoming sick from an infection. We assume that all the people in the room are initially susceptible to the infection.
The second category is the Infected I(t)- people come into Infected bucket leaving the susceptibles bucket assuming that once a person gets infected, he isn’t going to get infected again.
The third category is the Removed R(t)- people that either got recovered from the disease or died due to the infection come under this bucket.
S(t) + I(t) + R(t) = N
At the beginning,
S(0) = So (Entire population — infected people)
I(0) = Io (very small number of infected people, probably 1)
R(0) = 0 (there would be no one in the removed bucket as we are at the beginning)
So the rate at which people are transferred from one bucket to another are explained by a system of 3 differential equations
Let us go one by one
dS(t)/d(t) is the rate at which people move out from susceptibles bucket to the infected bucket. It is a function of suscpetibles and infected people.
dI(t)/d(t) is the rate at which people move out to removed bucket and move in from susceptibles bucket. It is a function of suscpetibles and infected people.
dR(t)/d(t) is the rate at which people move into removed bucket from the infected bucket. It is a function of only infected people.
Note: People moving out from a bucket is denoted as negative while people moving into the bucket is denoted as positive.
Let me explain what beta and gamma are. But before we go ahead, we should know the terms infection rate, interaction rate and recovery rate.
Infection rate: It is the probability of Infection, given contact between susceptible and infected individual
Interaction rate: It is how often does a susceptible individual and an infected individual come in contact
Recovery rate: It is basically the inverse of the time required for an individual to recover from the disease
Now,
beta is the effective contact rate which is the product of interaction rate and infection rate while gamma is the recovery rate.
What we essentially do is find out the parameters beta and gamma to forecast the numbers of the consecutive day
Forecasts for the next day are given by
Assumptions of the SIR model:
Birth rate and death rate are almost negligible and are not considered.
Entire population is susceptible to be affected by the virus.
At any given point of time, the individuals present in all the 3 buckets sums up to the entire population.
Let’s get into the coding part
Firstly, create an analytical dataset of the following form
and then define your equations and optimization function
### Defining the SIR model equations
def ds_dt(S,I,beta,N):
return - beta * S * I / Ndef di_dt(S,I,beta,gamma,N):
return (beta * S * I / N) - (gamma * I)def dr_dt(I,gamma):
return gamma * Idef SIR_model(t,y,beta,gamma,N):
S,I,R = y
s_out = ds_dt(S,I,beta,N)
i_out = di_dt(S,I,beta,gamma,N)
r_out = dr_dt(I,gamma)
return [s_out,i_out,r_out]
where s_out, i_out and r_out are forecasted values of susceptibles, infected and removed.
Next, define the initial set of values of susceptibles, infected and removed and then predict the values for the next 7 days using solve_ivp
def eval_model_const(params, data, population, return_solution=False):
beta, gamma = params
N = population
n_infected = data['Infected'].iloc[0]
n_susceptibles = data['Susceptibles'].iloc[0]
n_removed = data['Removed'].iloc[0]
max_days = len(data)
initial_state = [n_susceptibles,n_infected,n_removed]
args = (beta,gamma,N)
sol = solve_ivp(SIR_model, [0, max_days], initial_state, args=args, t_eval=np.arange(0, max_days))
sus, inf, rem = sol.y
y_pred_inf = np.clip(inf,0,np.inf)
y_pred_rem = np.clip(rem,0,np.inf)
y_true_inf = data['Infected'].values
y_true_rem = data['Removed'].values
optim_days = 7 # Days to optimise for
msle_cases = np.square(y_pred_inf - y_true_inf) + np.square(y_pred_rem - y_true_rem)
msle_final = np.sqrt(np.mean(msle_cases[-optim_days:]))
if return_solution:
return msle_final, sol
else:
return msle_final
solve_ivp: For a given set of initial values and next ’n’ periods, it forecasts the values of susceptibles, infected and removed by solving for the rates of susceptibles, infected and removed.
You can go through the following link to learn more about solve_ivp “https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.solve_ivp.html”
Note that the error here is determined by square_root{mean[square(predicted_infected — true_infected) + square(predicted_removed — true_removed)]} of the past 7 days.
Final output is the error in predicted vs forecasted values for the past 7 days.
Next, fit the model
from scipy.optimize import LinearConstraint
linear_constraint = LinearConstraint([[1,-10]],[-np.inf],[0])def fit_model(df_ADS_all,Population = 2500000):
# Get last 8 rows for optimization
df_data = df_ADS_all[-8:]
df_data = df_data.reset_index(drop = True)
# Optimize for params
OPTIM_DAYS = 7
res = minimize(eval_model_const,
x0=[0.01,0.011],
bounds=((0,1),(0.01,0.1)),
args=(df_data,Population,False),
method='trust-constr',
constraints = [linear_constraint],
options={'disp':True})
return res.x,df_ADS_all,Population
Optimize the parameters by minimising the objective function that is returned in the eval_model_const function to get the best values of beta and gamma.
params,df_ADS_all,Population = fit_model(df_ADS,6000000)
params gives an array containing the values beta and gamma where params[0] is the beta or effective contact rate and param[1] is the gamma or the recovery rate.
Finally forecasting the values
def fcast(interaction_rate,infection_rate,recovery_rate,df_data,nperiod,Population):
in_susceptible = df_data.Susceptibles.iloc[-1]
in_infected = df_data.Infected.iloc[-1]
in_removed = df_data.Removed.iloc[-1]
initial_state = [in_susceptible,in_infected,in_removed]
max_days = nperiods + 1
beta,gamma = interaction_rate * infection_rate,recovery_rate
args = (beta,gamma,Population)
sol_forecast = solve_ivp(SIR_model, [0, max_days], initial_state, args=args, t_eval=np.arange(max_days))
forecast_date = []
initial_date = df_data.Date.iloc[-1]
for i in range(max_days - 1):
next_date = initial_date + datetime.timedelta(days = i+1)
forecast_date.append(next_date)
df_forecast = pd.DataFrame({'Susceptibles':sol_forecast.y[0][1:],
'Infected':sol_forecast.y[1][1:],
'Removed':sol_forecast.y[2][1:],
'Date':forecast_date})
df_forecast['Susceptibles'] = np.round(df_forecast['Susceptibles'])
df_forecast['Infected'] = np.round(df_forecast['Infected'])
df_forecast['Removed'] = np.round(df_forecast['Removed'])
df_forecast['Confirmed'] = df_forecast.Removed + df_forecast.Infected
return df_forecast[['Date','Confirmed','Susceptibles','Infected','Removed']]
Get the initial values of susceptibles, infected and removed and forecast the values for the next n_periods
df_forecasts = fcast(params[0]/0.5,0.5,params[1],df_ADS_all,65,Population)df_forecasts.head(5)
Defining population: In the case of a country, we cannot assume that the entire population is susceptible because there might be some states/districts where there are no infected people. So, we decide the parameter “population” which is done on an iterative basis. We will choose different population scenarios and based on the results that we have for each of those scenarios in terms of MAPE and when the disease is expected to peak, we will choose a population.
There are a bunch of different variants of the standard SIR model. One such variant is the SEIR model. I’ll not go into the details but let me you give a brief introduction.
There is a fourth category in SEIR model which is exposed category. The reason for this category is when you first get infected with a disease, often you may not immediately have symptoms and may not be actually spreading the disease to others. You may not currently be infectious. Nevertheless you have actually received it. It might be considered as a period where the infection is getting developed within somebody. After some days, these exposed people would then transition to be infectious where they are able to infect other people and then recover after that.
References: