Visits variations analysis from the visits detection capability perspective

Hector Pinheiro
Incognia Tech Blog
Published in
10 min readNov 7, 2018

The detection of visits that occurred in a specific place is on the core of In Loco’s technology, and our teams are continuously working to improve the detection capability, which refers to the ability to detect visits whenever they occur. The task is performed on mobile devices using partner applications that use our SDK (Software Development Kit).

Anonymous data are collected not just to detect visits but also for posterior extraction of contextual information associated with them (as the place and the intention of the visitor). Since there are several restrictions on data collection routines imposed by operating systems and several concerns about device resources (such as power consumption), the improvement of detection capability is not an easy task and demands the development of intelligent algorithms that identify the best moments to collect relevant data.

Our detection capability is profoundly impacted by the number of mobile application partners and by our algorithms (that are executed within them). Updates on our SDK or the availability of new partners affect the number of observed visits in our database directly, causing variations that make it difficult to compare in different time periods. This way, an essential step in the analysis is to estimate the influence caused by the variation of the detection capability.

Note

As Medium does not support math expressions naturally in the text, most of them will be cited with regular alphabet letters rather than the expression itself.

For a particular location, given two different time periods i and j, and the number of visits observed on them (Vi and Vj), the relationship between them can be expressed as

where Cij defines the visits variation coefficient between the time periods.

Table 1 presents the visits and variation coefficients observed in São Paulo for different months. One can notice the increase in visits amount from Aug 2017 to Oct 2017. A variation coefficient of 1.86 was observed in Sep 2017, which represents a 86% increase in the visits amount. Although any clustering algorithm can perform this step, it is natural to believe that most of it was caused by the ability to detect more visits.

Here, we describe our approach to estimate the influence of our visits detection capability on the variation coefficients. It consists of proper modelling the visits variation coefficients and decomposing them into components with specific variation causes. By identifying which parts are impacted by changes in the detection capability, we may use estimations of these components and remove them from the observed visits variations.

Table 1 — Visits and variation coefficients observed in São Paulo (Brazil) for different time periods. The variation coefficients are computed by comparing the visit amount with the previous month.

Variation coefficients modeling

The detection capability relates the number of observed visits and the number of visits that actually happened. For a given place and time period, i, the number of observed visits are

where V^i is the actual number of visits and Deti is the visits detection capability, which can be seen as the proportion of actual visits that is observed by our technology. Furthermore, given the detection capabilities in two different time periods, we may define their relationship as

where Cdetij defines a variation coefficient between them.

Although it is difficult to define the detection capability itself, the estimation of its variation through time is more feasible. Our goal here is to model the variations observed in places to estimate the variations of the actual visits, removing their effect in the detection capability.

Combining Equations 1–3,

By defining,

we define a variation coefficient that relates to the estimation of the actual visits occurred in a given place in two different time periods. The next step relies on the estimation of Cdetij. Observe that the estimation of the variations in detection capability defines an estimation for visits variations, but not for the visits amount — it would only be possible if the actual visits were defined in at least one time period (Equation 6).

Depending on the changes that may occur in our visits detection capability, different impacts are observed in the number of observed visits. For instance, the increase of partner applications increases the number of devices producing visits while improvements in our detection algorithms may increase the number of visits observed for each device. Taking these aspects into account, we may define the visits observed in a time period i as:

where Di is the number of devices that produced visits and Ri is the ratio between the number of visits and the number of devices. Similarly, the actual visits are defined as

Combining Equations 2, 8 and 9, we can decompose the detection capability into two different components:

where DetDi defines the proportion devices that are observed and DetRi defines the proportion of visits that are observed for each device. Furthermore, the detection capability variation can also be decomposed into these components:

and the estimation of the analysis of the detection capability can be performed by estimating the D and R components.

Estimation of variation coefficients

It is not possible to estimate the detection variability in a particular place by observing its visits alone. Fortunately, variations in the detection capability cause a global effect in the visits variations, which allows us to estimate Cdetij of a specific place using the variations observed in other regions.

However, the effect caused by detection capability variations may vary significantly from to region to region. Therefore, it is necessary to estimate the variations for different areas and combine them in the estimation for a particular place.

Regions of influence

The basic idea of our approach is to estimate the detection capability components of the visits variations of a place (or region) — referred as “origin” — based on the so-called “regions of influence”. These geographic regions are chosen based on their influence on the visits that occurred in the origin region. The variations observed in the regions of influence are combined to estimate the variations components of the origin. This way, we use the global effect of the detection capability variations to estimate the local influence of these variations.

We will now describe our approach to find the regions of influence of a certain origin and how we can define weights for them, that can be used to combine their variations taking into account their importance in the observed variations. Figure 1 presents a diagram of our approach.

The first step consists in identifying the devices that visited the origin region in the period of analysis. These devices are then tracked, and all their visits are used to define the regions of influence. The assumption here is that only the regions visited by origin’s visitors would strongly influence future visits to the origin region.

The regions of influence are then defined by the spatial clustering of the visits’ locations. This step can be performed by any clustering algorithm. A simple way of achieving this is to convert the geographic coordinates of the visits to geohashes with appropriate precisions. This approach performs a geohash aggregation of the locations. In our tests, we used geohashes with 6-byte precision, which produce rectangular regions with dimensions of approximately 1.2km by 0.6km. To simplify the analysis, regions with low visit counts were not considered.

Figure 1 — Diagram of the visits variation coefficients estimation of an origin region.

Once fixed the regions of influence of a given place, we may associate each of them to a weight that is proportional to their importance. Here we considered an important region as a region that is frequented by many visitors from the origin.

For each region of influence RIi of a given origin, RI={RI1, RI2, …, RIn}, a weight i is computed and associated with it and then normalized to sum one:

Given the number of an origin’s visitors, Ki, that also visited the region of influence RIi, the associated weight is defined as

Figure 2 shows the regions of influence for two different origins located in São Paulo. Observe that the number of regions of influence can vary dramatically depending on the origin.

Figure 2 — Regions of influence for two different origins located in São Paulo (Brazil). This image was generated using Uber’s Kepler geospatial analysis tool.

Coefficients combination

As the detection capability variation of the origin is estimated by combining the variations estimated in the regions of influence, we need first to define Cdetij for each one of them. We could define the estimation of Cdetij by computing the visits variation on the region. However, as mentioned before, depending on the changes that occur in our detection capacity, different aspects of the variation may change. As expressed in Equation 11, the visits variation may occur from the difference on the detection of different devices, or from the variation on the visits produced by each device. For this reason, a better estimation for Cdetij is defined by the estimation of Cdet,Dij and Cdet,Rij independently.

Given the number of devices observed in different time periods i and j, Di and Dj, the variation on the device detection capability can be defined as:

Similarly, given the observed ratios between the number of visits and the number of devices, Ri and Rj, the variation on the visits and devices ratio is defined as

Then, for each origins’ region of influence, RIk, 1 ≤ k ≤ N, its estimated detection capability variation, Cdet,kij, is defined as

Finally, we can define the origins’ detection variation as the linear combination of the variations estimated for the regions of influence using their weights:

Observe that by using the weights to combine the detections variations, we assign the participation of the region of influence in the estimated value according to its importance for the visits that occur at the origin.

In this modeling, the combination of detection capability variations of the regions of influence is the same as the combination of theirs visits variations. Combining Equations 14, 15 and 17:

where Vki and Vkj are the visits observed in the region of influence k, in times i and j, respectively.

Equation 19 requires visits to be observed in all the regions of influence in time i. Since these regions are possibly not fixed and may change over time, we must define a modeling that is well defined for the case where no observed visits occur in time i. For this reason, instead of directly combining the visits variations of the regions of influence, we may combine the visits observed in times i and j and then compute the variation observed in the combined visits:

As presented by Equation 7, the estimated variation coefficient can be used to normalize the variation observed in the visits of a specific place. Instead of trying to estimate the capacity itself, the normalized coefficient can be used to analyze the variations observed in the visits regardless of the detection capacity variations. This kind of analysis removes the variation effects of changes occurred in the visits detection process, allowing a more accurate analysis of the observed flow in places and regions.

In order to validate the proposed normalization method, we used reliable data about people’s flow in some commercial places. These data were provided by partners who are able to perform accurate measurements of the visits occurred in their facilities. The visits variations observed on these data were used as ground truth.

Figure 3 presents an example of the comparison between the standard and normalized variations measures observed in a place and the correspondent ground truth. The variations were defined on a daily basis by computing the visits variations compared to a fixed initial date (as in Table 1). In this case, the initial date was fixed at 04/02/2018 (first Monday of April).

Figure 3 — Example of normalization of the visits observed in a certain place.

Note that an increase in visits was observed at the end of July (black curve). Similarly to the data presented in Table 1, this increase in the visits numbers comes from the increase in the visits detection capability. This is evidenced by the variations observed in the ground truth (blue curve), which did not show this behavior.

The normalization method is able to correct this by removing the detection capability variations components of the observed visits. The resulting normalized variations (red curve) are close to the ground truth, showing that the flow behavior was preserved regardless of the detection capability variation.

Are you interested?

If you are interested in building context-aware products through location, check out our career page. Also, we’d love to hear from you. Leave a comment.

--

--

Hector Pinheiro
Incognia Tech Blog

Researcher at Incognia, PhD - Machine Learning and Signal Processing.