Geospatial Fun with Trip Data and Postgres, Part 1

Sergei Kozyrenko
aiincube-engineering
5 min readMay 1, 2020

A big part of Parknav’s operations consists of collecting parking information and validating our AI street parking predictions in the real world. Members of our data collection team arm themselves with our data collection app and go out into the city on foot, on bicycles and by car. The app continuously tracks and records their GPS location, matches it to the street they are on and allows entering different types of information associated with their current location — from parking restrictions to number of available parking spots on the street.

Data collection can be a lot of fun

As a result, we end up with huge number of GPS locations in our Postgres database, each one representing a point in the path that our collector took while navigating through the city. Over time, we have come up with numerous queries that allow us to quickly validate and analyze these resulting paths, and we think they will be valuable to others working with the similar datasets. Same analysis would also apply to data coming from turn-by-turn navigation apps, FCD (floating car data), your running / biking app and many others.

Assume your geospatial data on recorded collector paths is stored in a table like this:

create table gpspoint
(
id uuid constraint gpspoint_pkey primary key,
date_time bigint not null,
geom geometry(POINT, 3857) not null,
path_id uuid not null,
collector_id uuid not null,
...
);

where date_time is in millisecond and represents the date and time when point was recorded, geom is a Cartesian geometry with EPSG:3857 projection, and path_id is the same ID for all points in the same collector path (trip).

We typically use geometry, rather than geography for storing geospatial coordinates — Cartesian math is significantly faster for our use cases. This means that latitude and longitude end up being projected onto a plane, which causes some distortion depending on the chosen projection. We use a separate database for each city we operate in, and this keeps the distortion between any two points within the database pretty small, especially when using a “local” projection tied to a specific municipality or a state, rather than a world-wide projection like ESPG:3857 (Web Mercator). This does, however, make the queries a bit more complex as you’ll see below.

Looking at time and spatial distance between consecutive GPS points lets us know which paths might have had issues with GPS recording — for example, GPS signal was lost, data collector was going through a tunnel and so on. In a perfect recording, both spatial and time differences will be small, and the longer they get, the more GPS issues were encountered. Our goal is to find paths that have larger than normal distances or time differences between any two consecutive GPS points.

Typical recording of map-matched GPS points
Slightly “jumpy” recording

Window functions in Postgres help tremendously here. They allow us to operate on a partition of table rows (similar to group by), yet retain all rows in the output instead of getting back one aggregated output row. At the same time, we can rank the elements within the partition and get a hold of neighboring rows for every row in the partition. The partition for us is natural — by path_id, so that each collector path ends up in a separate partition.

We can analyze the entire table and sort the paths by either largest spatial or time difference between consecutive points like this:

select
min(obs_time)::timestamp(0) as started_on,
max(obs_time)::timestamp(0) as finished_on,
round(max(distance_delta)) as max_distance_delta,
max(time_delta)/1000 as max_time_delta,
path_id from
(select
path_id
,
to_timestamp(date_time/1000) as obs_time,
ST_Distance(geom, lag(geom) OVER w) as distance_delta,
date_time - lag(date_time) OVER w as time_delta
from gpspoint
window w as (partition by path_id order by date_time asc)
) deltas
where distance_delta > 0 group by path_id order by max_distance_delta desc;

The inner SELECT statement creates a WINDOW w of all points within the same path_id — essentially, all points belonging to the same path in the same order they were recorded. This is achieved via (partition by path_id order by date_time asc). We then use the lag() function to get a preceding row for calculating the spatial and time differences between each consecutive pair of rows. Finally, we group the results by path_id again and output the time when path recording was started, when it ended, and max spatial and time deltas between any two consecutive GPS points. This immediately lets our data analysts see whether any paths had any significant issues.

Keeping spatial distance small

The resulting max_distance_delta here will be in EPSG:3857 “meters”. A lot of times it’s good enough for internal analysis, but for cases where we do need to report distances in real meters, we have to figure out the rate of distortion between the Cartesian projection and ESPG:4326 (latitude/longitude). This can be done by taking a random point out of our table of GPS points (this works because we keep a separate database per city so points are not spread out all over the world), moving it by one meter in EPSG:4326, and then seeing what the difference was in ESPG:3857. Finally, we’d apply the ratio to the distances reported by the ESPG:3857 queries to get the final result.

The entire query now looks like this:

with distortion as (
select ST_Distance(geom,
st_transform(
st_project(st_transform(geom, 4326), 1, 0)::geometry, 3857))
as meter_distance from gpspoint limit 1
)
select
min(obs_time)::timestamp(0) as started_on,
max(obs_time)::timestamp(0) as finished_on,
round(max(distance_delta / meter_distance)) as max_distance_delta,
max(time_delta)/1000 as max_time_delta,
path_id from
(select
path_id
,
to_timestamp(date_time/1000) as obs_time,
ST_Distance(geom, lag(geom) OVER w) as distance_delta,
date_time - lag(date_time) OVER w as time_delta
from gpspoint
window w as (partition by path_id order by date_time asc)
) deltas, distortion
where distance_delta > 0 group by path_id order by max_distance_delta desc;

Till next time!

--

--