Our framework for real-time filtering of location streams
One of the biggest challenges with continuous location tracking is dealing with volatile quality of smartphone’s GPS readings. Numerous factors affect GPS accuracy such as:
- Quality of GPS receiver
- Source of signal (GPS, WiFi, cell tower triangulation)
- Environment (weather, skyline visibility, enclosed spaces, multipath reception)
- Device state (low power mode, flight mode, initial fix)
Due to the error introduced by all these factors it becomes essential to carefully process the location stream in order to accurately predict the path taken by a driver.
At HyperTrack, we have developed half a dozen filters through which a driver’s location stream is processed before our business users visualize it on a dashboard and end customers visualize it within their product experience. In this blog, I want to brief you on how we designed our filtering pipeline and how we built a validation framework around it.
At a high level we have two categories of filters. Binary filters (yes/no) to remove noisy GPS readings and adjustment filters to fine-tune GPS readings for smoother polylines and accurate distance calculation. Binary filters look at attributes of the GPS reading (source, speed, bearing, accuracy, timestamp) and apply heuristics around distance, acceleration, clustering, stationary vs moving, etc. to weed out inaccurate GPS readings. Adjustment filters are more algorithmic & resource intensive in nature. While each of those are worthy of a blog post on their own (coming soon), I will provide a brief overview of our adjustment filters:
- Kalman Filter: This uses the past location stream to predict the next GPS coordinate, which is then weighted-averaged (using noise) against actual GPS reading to derive the final coordinates. This is compute intensive and requires past readings to build historical models. We find it effective in correcting errors in noisy data.
- OSM Filter: This uses publicly available OpenStreetMap road data and maps our GPS traces to nearest possible roads. This is the last filter in our pipeline and takes in GPS traces that have been filtered by all the other filters, and snap them onto a road. This is data intensive and requires lot of DB querying but we have optimized it to run in tens of milliseconds. Following screenshot shows how we take raw GPS readings (red markers) and map them onto roads (blue markers).
Let us look at some of our design constraints. We have hundreds of drivers sending thousands of GPS readings through our driver SDK. It is imperative to process this incoming location stream quickly and efficiently so we can display near real time status of these drivers. Every non-essential task (including filtering) needs to be performed asynchronously to have minimum turn around time in order to conserve battery on driver’s phone. We support offline mode where drivers need to continue operating their apps through patchy or no network areas for extended periods of time. When the network is back up, their accumulated locations are sent to the server in parallel batch calls. Hence the server can’t rely on receiving messages in the order they were recorded and should be able to process out of order location streams.
To satisfy these constraints, we needed an in-memory datastore which can maintain ordering of messages. We chose a combination of Redis SortedSet & Hashes to store hot data i.e. last X hours of GPS readings for every driver we are tracking. SortedSet supports ordering by timestamps and has an interface for retrieving keys before and after a given key (context is required for filtering). Hashes are great to store actual objects for quick lookups. Finally, we are using Celery with CloudAMQP for asynchronous processing.
Everytime GPS readings are posted to the server, we write them to Redis and put a filtering task on Celery. The task gets picked up by a worker which builds the filtering context and runs through all the filters in our pipeline. Ultimately the result is updated in Redis and our persistent store PostgresDB. We care about performance and have optimized this workflow to run in hundreds of milliseconds end-to-end.
Look at the example below where a driver’s smartphone is sending us noisy GPS readings (on the left) and how we filtered it (on the right) to make sense out of his trip.
We have also built a validation framework to benchmark effectiveness of our filters. We chose a test-suite of ~10k trips and compared how far our filtered points are from where they ideally should be. Keeping this as a benchmark we always run the suite before deploying a substantial change, and only go ahead when the results are trending into positive direction. We also use the test-suite to fine tune the configs on existing filters or tweak our algorithms. It helps us iterate rapidly in that we can build a new filter, tweak it, measure the impact and make deploy/discard decision in a short amount of time. We recommend this data oriented approach while optimizing for a fuzzy problem.
We are having fun building the location tracking platform, hope you will enjoy using it as well. Sign up with HyperTrack to try out our APIs. And stay tuned to learn more about our internals!