Coronavirus model using R — Colombia

Daniel Pena Chaves
4 min readFeb 27, 2020

--

This morning I saw a post by Learning Machines where I learned to build a simple model for epidemics such as Coronavirus( 2019-nCoV). I decided to replicate this model for a smaller population — Colombia, using the data gathered by Rami Krispin. These are my findings.

Load libraries

library(ggplot2)
library(deSolve)
library(tidyverse)

Data

I downloaded the data <coronavirus> from RamiKrispin’s Github

Managing Data

In order to be able to use this data for our model, I added all the data points so that we had a number of total cases for each given day. I took out Feb-13 because it added more cases due to new identification techniques, and this may affect the way the model fits. I am yet to try to work out the model with this info, I will try it again someday.

c1 <- coronavirus %>%
select(date,type,cases) %>%
group_by(date,type) %>%
summarise(total_cases = sum(cases))
c2 <- c1[which(c1$type=="confirmed" & c1$date != "2020-02-13" ),]c2$cumtotal_cases <- cumsum(c2$total_cases)ggplot(data= c2 ,aes(date,cumtotal_cases, colour = type)) + geom_line()
Data gathered from Rami Krispin’s GitHub

Model

This model is adapted from a Learning Machine’s post-

I set the population = 50,000,000 which is a good enough estimate for Colombia right now. Also, I created a vector Infected which takes in the info from the <coronavirus> data.

SIR models, the dynamics of the outbreak and uses three differential equations, in order to do so:

dS, dI, and dR, you can think of them as the rate of change for Susceptibles, Infected and Recovered for a given t.

Beta is the parameter that controls the transition between Susceptibles and Infected and Gamma controls the transition between Infected and Recovered:

pop <-50000000Infected <- as.integer(c2$cumtotal_cases)Dia <- 1:length(Infected)SIR <- function(time, state, parameters) {
par <- as.list(c(state, parameters))
with(par, {
dS <- -beta/pop * I * S
dI <- beta/pop * I * S - gamma * I
dR <- gamma * I
list(c(dS, dI, dR))
})
}

Sum of Squared Differences

RSS takes the info given and minimizes the Sum of Squared Difference, between the number of infected -I(t)- and the corresponding number of predicted cases by our model I*(t)

RSS(β,γ)=∑ (I(t)−I*(t))²

init <- c(S = pop-Infected[1], I = Infected[1], R = 0)
RSS <- function(parameters) {
names(parameters) <- c("beta", "gamma")
out <- ode(y = init, times = Dia, func = SIR, parms = parameters)
fit <- out[ , 3]
sum((Infected - fit)^2)
}

Opt <- optim(c(0.5, 0.5), RSS, method = "L-BFGS-B", lower = c(0, 0), upper = c(1, 1)) # optimize with some sensible conditions
Opt_par <- setNames(Opt$par, c("beta", "gamma"))
Opt_par
## beta gamma
## 0.5770345 0.4229655
t <- 1:100 # time in days
fit <- data.frame(ode(y = init, times = t, func = SIR, parms = Opt_par))

Interesting Stats

In the learning machine’s post R0 is calculated at 2.073224, we have a much lower value at 1.364, this R0 refers to the number of healthy people that get infected per number of infected people. I believe it is much lower because the data that I am using here shows a far slower expansion, also maybe the population difference might affect this indicator.

The height for the pandemic in our model is at day 64 with almost 2 million people infected and a maximum of 40,000 dead. Compared to Learning Machines height at day 50 with over 200 million infected and 4 million dead.

par(old)

R0 <- setNames(Opt_par["beta"] / Opt_par["gamma"], "R0")
Height <- fit[fit$I == max(fit$I), "I", drop = FALSE] # height of pandemicMax_Dead <- max(fit$I) * 0.02 # max deaths with supposed 2% mortality rateR0 ## 1.364259Height ## 64 - 1966588Max_Dead ## 39331.75

Graphs

I like nice graphs, so the code is long, therefore I will just leave the first one.

graph11 <- fit %>% gather(key, value, -time)bigplot <- ggplot(graph11, mapping = aes(x = time, y = value, color = key) ) +   
geom_line(size =1.25)+
scale_color_manual( values =
c("red1", "green3","gray1"))+

theme(
plot.title = element_text(size = 12, face = "bold",hjust = 0.5),
plot.caption = element_text(size = 8, face = "italic"),
legend.position="top",
legend.title = element_blank(),
legend.box = "horizontal" ,
legend.text=element_text(size=8.5),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "gray50", size = 0.5),
panel.grid.major.x = element_blank(),
panel.background = element_blank(),
line = element_blank(),
axis.ticks.length = unit(.15, "cm"),
axis.ticks.y = element_blank(),
axis.title.x = element_text(color="black",
size=12),
axis.title.y = element_text(color="black",
size=10,
face="italic"))+

scale_y_continuous(expand = c(0, 0),
limits=c(0.0,50000000),
breaks=seq(0.0,50000000,10000000),
name = "Number of subjects")+

scale_x_continuous(expand = c(0, 0),
name = "Days")+

labs(title = "SIR Model 2019-nCov Colombia",
caption = "Info taken from RamiKrispin. Adapted model from Learning Machines.")
bigplot

Final Thoughts

This model is very simple and there is a huge number of other variables that need to be considered, such as density, climate, government response … etc.

It was fun to build. Many thanks to RamiKrispin for the data and to Learning Machine (Holger Von Jouanne) for the tutorial.

--

--

Daniel Pena Chaves

Trying to learn Data Science one day at a time. Follow me for easy R projects.