Data Science: Stay Point Detection 101

Published in

tb.lx insider

11 min readJul 17, 2024

Where and when you stop says a lot about you.

Modern cars and trucks are packed with sensors to measure the state of their most important variables: location and speed. Location and speed data usually originate from the GPS receiver, while speed data may also come from the vehicle’s speedometer. These vehicles generate many more signals, such as the fuel tank capacity, the high-voltage battery charge for electric vehicles, the vehicle’s mileage, and an assortment of other specialized signals to assess the vehicle’s health and status.

It is now quite common for vehicles to stream this data over the air to the brand or owner’s cloud servers for intelligence extraction. This is a well-known fact of life; if you own a new car, this is your daily reality. But have you ever wondered how your vehicle’s data becomes valuable knowledge and how to use it?

In this article, we’ll investigate a specific method of extracting useful information from vehicle telematics: the sensor data streamed from your vehicle to the cloud. You will see how some simple data transformation steps can provide powerful insights for drivers, fleet managers, and fleet owners.

Trajectories

Moving objects, like cars and trucks, leave a trace in space and time that we call a trajectory. While we can think of a trajectory as a continuous line in the mathematical sense, we never use it as such. Instead, we use points sampled from the trajectory line as a discrete representation of it. With this sampling, we can more efficiently represent a trajectory in the digital realm, both in computer memory and in external databases.

A spatial trajectory is a trace generated by a moving object in geographical spaces, usually represented by a series of chronologically ordered points, e.g., 𝑝1 →𝑝2 →⋯→𝑝𝑛, where each point consists of a geospatial coordinate set and a timestamp such as 𝑝=(𝑥,𝑦,𝑡).

Zheng, Yu. Trajectory Data Mining: An Overview, ACM Trans. On Intelligent Systems and Technology, Vol. 6, №3, Article 1, Pub. date: Sept. 2015.

According to a reference paper on the subject [1], we can divide a trajectory into three significant parts: the trajectory endpoints, moves, and stops. A move is a contiguous sequence of points from the trajectory where the vehicle speed is faster than a given value, say 3 km/h. This criterion helps to filter out the well-known GPS drift that occurs when the vehicle is stationary. When not moving, the car is in a stop state. Although trivial, this concept packs a punch.

This is an image showing that a trajectory contains sequences of moves and stops. This is shown with GPS points on a trajectory. — **Figure 1** — A trajectory contains sequences of moves and stops.

The picture above illustrates the concepts we have just discussed. Following the temporal line below, we see the distribution of GPS samples in space. Note how the sequence of samples during the stop clumps together with some variation due to the GPS drift.

Segmentation is splitting a trajectory into sequences of moves and stops. After cleaning the data, this segmentation process is vital to identifying the stops or stay points, the cornerstone of the method described herein.

Stay points and their sequencing convey a significant amount of helpful information on vehicle and fleet behavior.

What is a stay point?

But what is a stay point? A stay point is a geospatial region where vehicles frequently stop nearby for a minimum amount of time. Trucks stop at these locations for various specific reasons: refueling, rest, meals, work, or even due to heavy traffic conditions (although these should be rare).

We detect stay points using data generated by the trip segmentation process. A density-based clustering algorithm picks the “stop” data and uses it to create geographic clusters, discarding all other points as noise. The cluster definition is translatable to a geo-fence polygon that defines a classification criterion: points inside the geo-fence belong to the cluster while all the others don’t.

The number of visits to a specific place measures its importance and relevance to the fleet.

Why do we need stay points?

While the trip defines how the truck goes somewhere, the stay point helps explain what it is doing there. Stay points define the trip’s semantics and meaning.

Understanding the nature of the stop location can be relevant. Usually, these depend on the circumstances. For instance, a fuel station can serve several purposes, such as refueling, resting, or eating. We can recognize a fuel station (or a recharging facility) by inspecting the variation of the fuel tank level (or the battery state of charge). The vehicle weight variation may also help to identify locations where the truck loads or delivers cargo.

Identifying other stop classes is possible by using some inference from existing information. For instance, fuel stations can double as leisure locations because drivers also use them for meals and to rest. Determining this requires detecting a stop in a known fuel station where the fuel tank level does not change.

Inference and Prediction

We can measure several relevant features from a popular stop that will help with assumptions and predictions later. Timing information is appropriate when characterizing specific stop locations. Frequently occurring entry and exit times may help detect abnormal or unexpected events. It is also necessary to measure the permanence duration, as it can help discover leisure or rest-related stops. We must also consider fuel level changes or consumption patterns during the stay. During the visit, the driver may keep the engine running for comfort purposes (air conditioning).

What do we use stay points for?

Stay points are the limits of trips, where they begin and end. By uniquely identifying each stay point, we can also name the trips that start and end at the same stay points. Thus, all trips starting at stay point A and ending at B can be grouped and studied. Assuming these trips follow similar trajectories, they should also have similar behaviors, and if not, they will quickly become visible.

This is an image showing that all trips between between two stay points (A and B) are identifiable by their endpoint’s names. — **Figure 2** — All trips between stay points A and B are identifiable by their endpoint’s names.

By aggregating the variables we want to study over the trips with similar endpoints, we can collect valuable statistics that allow us to make predictions and evaluate the exceptionality of a given trip.

We can also study the sequences of stay points along the trips to uncover behavioral commonalities. Sometimes, vehicles will perform predictable sequences of stops due to business practices. Any departure from these known and repeatable sequences quickly becomes noticeable. The following image depicts a frequent behavior in the vehicle’s routes, a cycle (ABEG).

This is an image showing a trip with various end points, with letters A-F, showing what a loop (ABEG) looks like. — **Figure 3** — The trip graph contains a loop (ABEG).

Anomaly Detection

It is possible to detect anomalies in the known use pattern for stop locations. For instance, all timing-related information is helpful for an anomaly detection system that flags unusual entry and exit times and unexpected durations.

Recommendation

Knowledge of place semantics enables the implementation of recommendation use cases. For example, we can conceive of a scenario where an automated system recommends fuel stations, resting facilities, or even meals to drivers on the road. For such a recommendation system to work, it would need to know the vehicle’s destination, current location, and some contextual information such as fuel consumption and fuel level.

Prediction

We can use the frequency of stay point sequences to predict destinations or assess the frequency of a given sequence. This last case helps detect anomalies on frequent routes, like when an expected stop is missed or a sequence is reversed. The picture below illustrates the process of determining the frequency of stop sequences.

This image llustrates the process of determining the frequency of stop sequences. — **Figure 4** — We can calculate the corresponding probabilities, predict destinations, and detect abnormal behaviors by computing the triplet sequence frequencies.

To determine these frequencies, we need only count the number of three-stop sequences, as depicted above. We group the sequences using the first two stops and the third as the destination. In the above case, the A-B-C sequence is the most popular, so when a vehicle travels from A to B, you can infer with an 80% probability that it will go to C next. If, on the other hand, the vehicle travels to D, you know that this is a novelty and probably needs further inquiry.

We can also combine the expected frequencies to predict the probability of a more extensive sequence. For instance, we can combine the A-B-C probability with B-C-* and determine the most likely stop location after C through a probability combination (multiplication). You would be on the right track if this reminds you of a Markov process. The article below explores this technique using the digital map nodes recovered from map-matching.

Map-Matching for Trajectory Prediction
Where are you going? Should you be going that way?towardsdatascience.com

How to detect Stay Points?

As expected, the detection process starts with the telematics data. Our first job is to make a low-confidence guess where the stop locations might be. We do this by using simple signal guesses that the vehicle is stopped. These guesses usually include the vehicle speed being below a set threshold for a given period and possibly composed of other signals such as the parking brake or the ignition. When we finish this first step, we end up with sequences of data points for each candidate stop.

This image contains a map that shows all the stop candidates as clusters of blue dots. The least dense areas will later be removed. — **Figure 5** — The map shows all the stop candidates as clusters of blue dots. The least dense areas will later be removed.

Once we have this first set of guesses, we proceed to a second step, which involves grouping the stop locations using a density-based clustering algorithm: HDBSCAN.

Clustering

For each guessed stop sequence of the first step, we only keep the first data point to avoid biasing the clustering algorithm into giving more importance to infrequent long stops. We want to determine where vehicles stop frequently and nearby, and each stop sequence must count as an individual occurrence.

The HDBSCAN algorithm is a hierarchical version of the original DBSCAN algorithm that suffered from the limitation of requiring a minimum proximity distance parameter provided as an input. With the hierarchical version, we only need a minimum number of observations per cluster.

Geographic Clustering with HDBSCAN
How to explore geographic data with HDBSCAN, H3, graph theory, and OSM.towardsdatascience.com

Both algorithms have the nice feature of not needing the expected number of clusters; they determine that automatically. Both algorithms also partition the input space into individual clusters and noise. Although we can discard the noise points immediately, we must still fine-tune the clusters as they may also include points that we should discard. The latest version of HDBSCAN outputs an estimate for each cluster point of how much it belongs to the cluster, expressed as a probability. We then decide what to keep based on this estimate.

The following image illustrates the output of a sample cluster with the location probabilities displayed as colors, where red is the lowest.

This image contains a map that is displaying a cluster with all the retained points classified by color according to their cluster-belonging probability. Red dots have the lowest probability and should be removed. — **Figure 6** — The map displays a cluster with all the retained points classified by color according to their cluster-belonging probability. Red dots have the lowest probability and should be removed.

Cluster Geofencing

The cluster outputs from the HDBSCAN algorithm are just sets of data points grouped by assigning a unique cluster identifier (an integer). As we have seen from the previous picture, there is no defined shape around these points. If a truck stops in this area, how can we know if it is inside the cluster? The algorithm does not provide a shape for us to use; it merely estimates the probability of the new location belonging to the cluster. Unfortunately, this requires keeping the model fresh and in memory, which might not be suitable for production use.

The solution we present here is to infer a shape for the cluster and use it as a discriminator. If the new location lies inside the shape, then it belongs to the cluster, and we can assume that the vehicle is stopped there for the same reasons all the others did.

A geofence is a polygon limiting a region of interest, our cluster in this case. We now examine a method for creating the geofence from the cluster locations.

The image below shows a large cluster with an irregular shape.

This image shows a collection of close locations in a cluster (represented as red dots) before a geofence has been drawn around them. — **Figure 7** — A cluster is defined by a set of close locations. How can we turn this cloud of points into a geofence?

The first step is to place a tight concave hull polygon around it. You can read more about what a concave hull is and how to calculate it in the article below:

The Concave Hull
Creating a cluster border using a K-nearest neighbors approachtowardsdatascience.com

This image shows what a concave hull geofence looks like around the cluster of close points. In this case, it sits tightly on the external locations shown on this map. — **Figure 8** — The concave hull polygon sits tightly on the external locations. But is this what we want? The polygon is complex, costly to store and query.

The concave hull polygon fits tightly around the cluster’s most exterior points, but this is not ideal for two reasons. First, we should expect some “buffer” outside the cluster’s geofence to account for some variability in the cluster inclusion detection. A new location one millimeter outside the cluster would be classified as not belonging to the cluster, which does not seem fair. Secondly, count the number of vertices of the above polygon. To detect inclusions at scale with such complex shapes, we require specialized databases and algorithms that add to the overall solution’s cost.

This image shows what happens when you inflate the concave hull, which is not a positive experience. Inflating it means that there are even more polygon points to store and that one must determine the length by which to inflate the polygon. — **Figure 9** — Inflating the polygon only makes matters worse. We get even more polygon points to store and must, somehow, determine the length by which to inflate.

We can still “inflate” the concave hull with an arbitrary length, but this also has some issues. Due to the rounder shapes, we are increasing the number of polygon edges, and we must somehow find a good value for the inflated length. See the article below for a bit more detail on this topic.

A Metric for HDBSCAN-Generated Clusters
How can we determine the equivalent DBSCAN ε parameter for HDBSCAN-generated clusters?towardsdatascience.com

Our solution is to use the H3 geospatial partitioning and indexing tool to convert the concave hull polygon into a small set of hexagons.

Geospatial Indexing with Uber’s H3
Hexagon power!towardsdatascience.com

H3 is very convenient as it converts the complex polygon into a very small set of hexagons of arbitrary size (you can choose from 16 levels) with the benefit that each hexagon maps to a single sixty-four-bit integer.

This image shows what happens when you convert a concave hull polygon into the overlapping H3 hexagons of a chosen dimension. Each hexagon corresponds to a unique 64-bit integer, making storage and querying extremely inexpensive. These hexagons are shown on the map in a light blue color. — **Figure 10** — The final solution is to convert the concave hull polygon to the overlapping H3 hexagons of a chosen dimension. Each hexagon corresponds to a unique 64-bit integer, making storage and querying extremely inexpensive.

Our example cluster gets converted to an array of forty integers, which we can neatly store in any database, even without support for geospatial computing. To detect if a truck entered the cluster, we need to convert the truck’s location to the corresponding H3 value and search for it in the database. If not found, the truck is outside any cluster. If found, you got your cluster!

Conclusion

This article revealed the importance of knowing your fleet’s stop locations or stay points. These are the markers on your operational map, and their sequences and frequencies speak volumes about how your business behaves.

Detection starts with the truck’s trajectories, then a simple stop detection algorithm, density-based clustering, and geofencing. In this case, we used H3 as a post-processing step to dramatically reduce the cost and complexity of cluster storage and geospatial detection.

This article was written by João Paulo Figueira, Data Scientist at tb.lx 🚛🌿

References

[1] — Spaccapietra, S., Parent, C., Damiani, M. L., De Macedo, J. A., Porto, F., & Vangenot, C. (2008). A conceptual view on trajectories. Data & Knowledge Engineering, 65(1), 126–146. https://doi.org/10.1016/j.datak.2007.10.008

[2] — Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226–231. https://dl.acm.org/doi/10.5555/3001460.3001507

🚛🌿 If you’d like to know more about how we work at tb.lx, our company culture, work methodologies, tech stack, and products you can check our website and join our journey in creating the transportation solutions of tomorrow through our social media accounts: LinkedIn, Instagram, Youtube, Twitter/X, Facebook. 💻 🔋