In computer science in general and in data science in particular dealing with continuous values is sometimes tricky. There are plenty of ways to understand a given dataset and many more of extracting the value of it. However, in this entry, we will focus on an application of using discretization and clustering with spatial data. More precisely, we will use the DBSCAN algorithm and the MGRS representation with a small GPS dataset.
But, wait!! DBSCAN and what else ?
Well, let me introduce you some concepts first (skip until the following header otherwise). Discretization and clustering are well-known procedures exported from the data mining field to many others. A formal definition of discretization might be “the reduction of value numbers in a continuous variable by grouping them into intervals” a further explanation may be found at [Web]. Regarding clustering, a formal definition might be “detection of registry groups that share similar properties” more details can be read from [Web]. Their purpose is to reduce the amount of data into smaller subsets that contain more relevant information.
MGRS is a coordinate system based on the UTM and used by the NATO militaries. In the figure 1.2 can be seen the grid and the code of the cursor in the right bottom corner. As a grid, it divides the map several times depending on the chosen resolution. For instance, using 4 digits we can obtain a precision up to 10 meters. A code example can be seen in the next figure:
Figure 1.2: Visualization of MGRS in the area of Spain.
The main reason behind using MGRS is because it allows discretizing data by defining a specific level or resolution. There is an arguable loss in the process although the noise and redundant information removed make very advantageous to use it.
DBSCAN was presented back in 1996 in the University of Munich, Germany [Web]. DBSCAN is a density-based clustering algorithm which resulting clusters depends on the minimum number of points and the distance among them. In contrast with other clustering algorithms, DBSCAN only takes three parameters as input: the number minimum points, the distance, and the distance function. In the figure 1.3 can be seen the different output produced by some clustering algorithms compared to DBSCAN.
Its authors stated that the algorithm had an average run time of O(n∗logn) way better than the spatial clustering algorithms at that time of O(n 2 ). However, this complexity can be achieved under special conditions by using R-trees, which are hierarchical data structures used for the organization of geometric objects within a subset of d-dimension rectangles. As a result, it has been some discussion about its performance as can be read in the articles. The lower bound was proved to be O(n 4/3 ) as can be read in the proof wrote in article.
The details of these processes are more complicated, and the overall result might be expressed in terms of entropy. If you are curious about entropy, a further analysis into the entropy-based discretization or the always exciting information theory field might be worthy to consider [Web].
From this point, we will take the benefits of these techniques as granted and we will focus on its impact.
Figure 1.3: Comparison of DBSCAN with K-Means family. Extracted from
Ok, let’s come back to our main objective…
As we have said before we deal with geolocated data and more precisely with GPS data. In this case, we have access to a reduced dataset of events that contains the position and time. (Please note the data provided in this example was manually generated and any possible similarity is mere coincidence).
The raw GPS data contains the user, location and time each event was taken. Analyzing the raw data we might find out that it came from a young geoblinker that lives in Torrejon. A screenshot of qgis can be seen in the figure 1.4.
Looking at the image we can observe two events that have been highlighted, they are nearby as might be some other events. Therefore, why do not we apply the MGRS to reduce the number of very close points?
In order to appreciate the difference in precision, we will visualize at 10 meters and 100 meters of resolution. An MGRS cell is represented as a shapefile or box in the map, although we will use its centroids as it easier to understand. As we can observe in the figure 1.5, the 10m resolution is a balanced trade-off between error and reduction. The 100m resolution is too big to be able to represent properly the locations without produce a notable error. This small experiment contains just a 3 few events, nevertheless, the proportion of reduction is about 42% with the 10 meters resolution and 65% with 100 meters resolution. The reduction depends not only on the quality and quantity of the events but also on others factors such as the trip nature.
Figure 1.4: Visualization of raw gps data. Projection performed with qgis
Figure 1.5: Different resolutions in the MGRS representation. raw(left) 10m(center) 100m(right)
On the other hand, we may use the DBSCAN to obtain the stay locations of the events. As we said previously, DBSCAN needs three parameters which in our case will be 20 Meters as epsilon (transforming the distance), 8 minimum points and haversine distance function.
Figure 1.6: Resulting clusters with 8 minimum points and 20 meters. raw(left) dbscan(center) centroids(right)
Looking at the result, that can be observed in the figure 1.6, two registries have been obtained. One logical step might be aggregate the sequential timestamps to obtain the time spent in each location as well as the rest of the information in the centroids. Reviewing the new features of the centroids, we could state that the young geoblinker was in two places during the same day. Going into detail, our friend was in the morning in a residential area which likely means he or she was at home. In the afternoon the geoblinker went to the shopping mall most likely to the car garage to check the car.
Concluding we have reviewed some concepts and techniques to approach data mining to spatial data. In this case, we face an easy scenario which was enough to show the strengths and weaknesses of each purposed method. DBSCAN is a powerful algorithm that allows us to retrieve easily the stay location choosing the appropriated parameters, the downside is the lack of good libraries and the penalty in the performance which is not linear as can be obtained using a solution based on MGRS.
The takeaway here is reminding the golden rule in Data Science: there is not a holy grail and the application of one method or another depends on each case… so we even might use several if it is required!