Modelling Virus’s Spread Based on Geolocation Data

Published in

Inloco Tech Blog

8 min readMay 14, 2020

Thanks to Afonso Delgado and Abel Borges for the huge contribution to this project

Pandemics happen because we live in a connected world. There are many challenges associated with the current COVID-19 pandemic, mainly due to how difficult it is to identify and treat the infected population. Therefore, it may be crucial to understand how and why the virus reaches new locations. This can help with the monitoring of hospitals capacity levels, e.g. to have an idea beforehand of when health system will collapse. Another reason is to understand when, during the outbreak, we will reach the peak of infections. A localized understanding of the pandemic phases empowers people and health authorities to act accordingly.

We at Inloco have been developing a unique geolocation technology over the years with the vision of making people’s lives more practical through ubiquitous technologies. And when COVID-19 arrived in Brazil, we tried to understand ways to use all this potential in initiatives that would help in combating the disease and provide inputs for understanding the scenario. During that time we created a series of metrics, such as social distancing index, and we have developed research projects that employ our aggregated, anonymized data to model virus’s dissemination aspects.

In epidemiology, the basic reproductive number is the number of secondary infections that one infected person would produce in a fully susceptible population through the entire duration of the infectious period.

Ro number provides a threshold condition for the stability of the disease-free equilibrium point (for most models):

The disease-free equilibrium point is locally asymptotically stable when Ro < 1: the disease dies out;
The disease-free equilibrium point is unstable when Ro > 1: the disease establishes itself in the population or an epidemic occurs;
For a given model, Ro is fixed over all time.

Ro is very intuitive to help us understand why social distancing and other non pharmaceutical interventions are helpful and the impact they can have on decreasing virus transmission. This article from the Journal of Travel Medicine provides an estimate of Ro for COVID-19. Later, we will see how movement’s information may be related to virus spread.

The SEIR model

Compartmental models constitute a common approach to understand how epidemics evolve. Basically, these models divide a certain population into different groups, e.g. Susceptible, Exposed, Infectious, and Recovered. There are also models that include Maternally-derived-immunity and Dead groups, among others. Models like these are all inspired by the work in A.

Over time, people will transition from one group to another.

The model can be mathematically described by the following differential equations:

where S counts the number of susceptible individuals, E counts the exposed ones, I counts the infected population, and R counts the recovered population. At all time instants, the population total is supposed to remain constant (i.e. no deaths) at N=S+E+I+R. As stated above, people transact from one group to another over time and the parameters beta, sigma and gamma are the transition rates, that indicate the percentage of individuals that goes from one group to another. For this model, the basic reproductive number is Ro = beta/gamma.

Below we have a example plot of the SEIR model to visualize behaviour of each group along the spread’s time. At the beginning of the contamination, that is, when the horizontal axis is at 0, the entire population is considered susceptible, assuming that there was no previous exposure to the virus. As time goes by, the number of individuals in the exposed group increases and, consequently, the number of infected people increases as well. The infectious curve is the measure to learn about the distribution of cases over times and a general sense of the outbreak’s magnitude. Since it is the one that gets more attention from media during the first stages of the outbreak. In addition, an infection curve directly represents the percentage of the population that is infected.

The behaviour of each group during the outbreak

Taking geolocation data into account

Suppose that, for a given state, we would simulate one SEIR dynamics for each city independently. We would not be considering travels from one city to another, and these can be responsible to spread the virus more rapidly throughout the state. This scenario is a very good example of how social distancing measures is important and how it may be related to the basic reproductive number, Ro, since more people transit through different locations, faster is the spread.

But how can we add the movements flow to the model? One possible way is by using origin-destination matrices, or simply O-D matrices. Each cell of the O-D matrix represents the number of trips from the origin (row) to the destination (column).

An O-D matrix example by I Ekowicaksono et al 2016 IOP Conf

There are several ways to estimate the O-D matrix. In general the accounting of how many people went from one point to another is done by conducting surveys, but it can be expensive to keep them up-to-date. Our technology enables to obtain rough O-D matrices estimates in a more rapid, flexible, scalable way. But how exactly the SEIR model will use this matrix? There are some approaches based on the Nature articles 1 and 2. Here we consider that the number of new exposed individuals at location j and time t+1 evolve according to the following equation:

where we inserted two new components, in addition to those previously defined in the equations,

Denotes the proportion of infected population at location k at time t

The number of people travelling from location k to location j in one unit of time

We see now that the updated number of exposed individuals equals those from the previous generation plus those who arrived from different locations.

Considering this SEIR model adapted with the O-D matrices, we performed some simulations to understand how the virus would spread in the state of São Paulo, Brazil. The logic we used to build the O-D matrices was to observe the movement between the cities of the state in a period with and without restrictive measures and to create matrices for each day of the week, thus being able to differentiate the different patterns of the week. For the SEIR part of the model, we consider the parameters fixed at beta = 0.75, gamma = 0.3 and sigma = 0.2, that implies Ro = 0.75/0.3 = 2.5. These values were chosen so that they were aligned with the estimates of the Journal of Travel Medicine’s study. This assumes that for each considered city, the parameters are the same, which in practice is not necessarily true. In addition, we do not consider that parameters may change with time. From this we consider an initial number of 10 cases in the capital and simulate to understand how the dissemination would happen.

In the graph on the left, we have the curves of the susceptible, infected and recovered groups under location-aware SEIR dynamics as indicated by city-level O-D matrix. Here we consider the period from 05/17 to 05/31 to build the O-D matrices. We can see that the peak of infections occur at 77 days since first infection of the simulation with a total of 6.1% of the infected population. Another important point that we can notice is that the infection curve has another peak after day 77, which can characterize a kind of second wave of infection. Once we use the movement data of the region, we have the chance to capture scenarios like this from an understanding of the flow that people make between cities, spreading the virus in a non-uniform way between locations.

In order to measure the potential of the O-D-augmented SEIR dynamics to predict the spread of the virus throughout the state of São Paulo, we ranked the cities according to date of first case as predicted by the model compared to the actually observed date of first case, using data available here.
In total, 411 cities (out of 645, 63.72%) were included in the analysis; that amounts to 94.92% of the population of the entire state.

The Pearson correlation between these two rankings, also known as Spearman’s correlation, is 67.62%, with (frequentist) 95% confidence band 61.99%-72.54%.The scatter plot below clearly shows that the variance in the relationship is higher at the tails, which makes sense under the argument that (1) the top connected cities are easier to identify and (2) are also more likely to be bigger and then have first cases earlier than in other clusters of cities. In fact, the rank-correlation between the simulated date of first case and the city population is 78.51% (73.44% for the actual date of first case versus population).

It’s important to note that the only information that the simulated dynamics has on how connected are the cities is in the O-D matrix. The analysis suggests that there’s some (potentially useful) predictive power in such information, even though it’s not enough to explain most of the observed variance in the observed COVID-19 spread throughout the state (as measured by these rankings): the adjusted R2 of a linear model estimates the explained variability as 45.59%.

As mentioned, when we consider the information from the O-D matrices, we add to the model the fact that people coming from other places are able to influence the spread of the virus in a given region and in this way the virus spreads more quickly than predicted from multiple city-independent SEIR dynamics. Below, we have an animation that demonstrates spatially the spread over the territory of the state of São Paulo, considering the capital as “area 0”.

Summarizing

We exposed some of our best efforts regarding the understanding of the COVID-19 pandemic, in particular how location data may be useful to make sense of its dynamics;
We applied a modification of the SEIR model proposed in 1 and 2, to take into account daily travels between cities within the state of São Paulo as estimated by our technology in the form of O-D matrices;
The order in which the virus reaches new cities in the O-D-augmented SEIR simulation are in reasonable accordance with what is observed in official reports, even though much of the dynamics remains unexplained. We remark that no effort was made to choose SEIR parameters that best reflect the reality observed at São Paulo.

We’re already in touch with many researchers. If you believe that this data can help you, feel free to reach us out. Also, we organized all development discussed here in an internal project and we have plans to make it open source in the future.

Modelling Virus’s Spread Based on Geolocation Data

The SEIR model

Taking geolocation data into account

Summarizing

Written by Gabriel Teotonio