How to leverage Geopandas for faster snapping of points to lines
Needing to correctly align one set of data with another is a common task in many geospatial analyses. One example of this is snapping a set of locations to a linear network. This is usually an essential step in network analyses because you need to know the exact location of that thing on a given network. For instance, in order to calculate the travel distance to the nearest electric scooter, you need to know where it is within a city’s grid network of streets. If you need to know how many miles of river are upstream of a dam, you need to know where that dam is located on the river network.
In both cases, there is usually some measurement error in the location of the scooter or the dam, and it may not correctly align with the network. Or the location is correct, but the data source for the network has measurement error. Or both are wrong, which is often the case. Take your pick.
To get around this issue, you need to “snap” the locations to the nearest point on the network. This involves a few steps:
- Find the nearest line to the original location
- Find the nearest point on that line to the original location
- Move the original location to that point
- Proceed with the network analysis
This is relatively straightforward for a few points and lines but can be a slow operation if you have to do this with lots of data. In my case, I needed a reasonably fast way to snap the locations of hundreds of thousands of dams and culverts against aquatic networks with millions of line segments. I have been working with the Southeast Aquatic Resources Partnership to help analyze the relationships between aquatic networks and aquatic barriers like dams as part of building the Southeast Aquatic Barrier Prioritization Tool. In order to analyze aquatic connectivity, we first needed to overcome the fact that the locations of many dams were not actually on the latest available aquatic network data from the National Hydrography Dataset. You would think everything would be in the correct place, right? Nope.
Using Geopandas to snap points to linear networks
My tool of choice for processing these data is Geopandas, which provides a very nice interface for working with geometric data in Pandas in Python. Geopandas represents every feature — such as a dam — as a row with both attribute data, like the name of the dam, and the associated geometry representing its location.
One of the most powerful features is the ability to apply
an operation to each row in the dataset. If done properly, this is significantly faster than looping over each row and performing a similar operation. Sofia Heisler does a great job of describing loops vs apply
. However, I’ve found that correctly leveraging apply
and related approaches involves thinking about the problem inside out — which usually means I spend brain hours trying to figure out how to reduce CPU hours.
1. Use with a spatial index and find nearby lines
In order to find the nearest lines on a network to a given location, you need a spatial index. Otherwise, you would have to compare every location to every line — even the lines far away — and waste precious time calculating distances that are not useful. A spatial index provides you with the ability to quickly select only those features that overlap a certain bounding box, which greatly cuts down on the number of comparisons you need to make.
In our example, we need a spatial index on the lines we are snapping to. In Geopandas, this is created using the sindex
properly of a GeoDataFrame
called lines
:
# this creates and also provides us access to the spatial indexlines.sindex
Now we can query this spatial index using a bounding box to get a list of the lines that overlap that bounding box. Great!
Oh wait — we’re dealing with points — and points don’t have bounding boxes! What are we to do?
We can generate bounding boxes by applying an offset
to the bounds
property of a GeoDataFrame
called points
:
offset = 100bbox = points.bounds + [-offset, -offset, offset, offset]
Let me break this down:
points.bounds
is the GeoPandas accessor for the bounds of a geometry — even points — which gives back 4 columns:
minx miny maxx maxy
0 1394202.875 165652.562500 1394202.875 165652.562500
1 1393760.375 165302.625000 1393760.375 165302.625000
...
- adding a list of values to it that is the same number of columns allows us to quickly add or subtract values from each column. In this case, we subtract
tolerance
from theminx
andminy
columns, and add tolerance to themaxx
andmaxy
columns. Now we have values that will work for a bounding box that is +/-tolerance
around our points. - we are using
offset
as the distance within which we are willing to snap points to a line, and beyond which we consider them too far away and drop them from the analysis.
For a tolerance of 100 (meters, in the spatial projection of our data), this gives us:
minx miny maxx maxy
0 1394102.875 165552.562500 1394302.875 165752.562500
1 1393660.375 165202.625000 1393860.375 165402.625000
...
Now, we can apply
an operation to this bbox
to find get a list of the lines that overlap:
hits = bbox.apply(lambda row: list(lines.sindex.intersection(row)), axis=1)
Getting into the full details of how apply
works is pretty involved, so I’ll keep this short:
lambda row: …
is an inline function that is called for every row inbbox
axis=1
means apply this operation based on the values in the columns of each row, in this case[minx, miny, maxx, maxy
]lines.sindex.intersection()
queries the spatial index on the lines and gives back a list of line indexes. The spatial index itself provides a generator, so we use alist
to extract those values all out at once. NOTE: it does not give back theindex
on the linesGeoDataFrame
, if you have one defined, it gives back the ordinal index — position in the list — of each line. This will make a bit more sense below.
This gives us:
...
3 []
4 [71]
5 [71, 42, 61, 38]
...
This shows that the 5th point has 4 lines that might be within tolerance
distance to that point. This still isn’t in the most usable structure yet, so we need to transform the data into a longer set of rows:
import pandas as pd
import numpy as nptmp = pd.DataFrame({
# index of points table
"pt_idx": np.repeat(hits.index, hits.apply(len)), # ordinal position of line - access via iloc later
"line_i": np.concatenate(hits.values)
})
This gives us a new, longer DataFrame
that has one row for every element in the list of line indexes we got back from the spatial index. It implicitly drops those that didn’t have any lines within their window.
We use np.repeat
to repeat the index
value of the points
dataset based on the length of the number of hits for that point. We want the index
so we can join back to points
later. We use np.concatenate
to combine the lists of hits into a single column. values
gives us the numpy array of the values in the hits
Series, which is what numpy wants.
Now we have:
pt_idx line_i
0 4 71
1 5 71
2 5 42
...
Before we can do any spatial operations, we need to join back in the geometries:
# Join back to the lines on line_i; we use reset_index() to
# give us the ordinal position of each linetmp = tmp.join(lines.reset_index(drop=True), on="line_i")# Join back to the original points to get their geometry
# rename the point geometry as "point"tmp = tmp.join(points.geometry.rename("point"), on="pt_idx")# Convert back to a GeoDataFrame, so we can do spatial opstmp = gp.GeoDataFrame(tmp, geometry="geometry", crs=points.crs)
2. Find the closest line to each point
Now we have a GeoDataFrame
again and we can calculate the distance between each point and each of its associated lines to find the closest one:
tmp["snap_dist"] = tmp.geometry.distance(gp.GeoSeries(tmp.point))
We use the builtin distance
method on the line geometry of each row, and pass in a GeoSeries
of the point geometries. This is actually doing a vectorized operation against the entire set of rows, so it is reasonably fast (much faster than looping).
Now that we have the distance, we can sort on it — while discarding any :
# Discard any lines that are greater than tolerance from points
tmp = tmp.loc[tmp.snap_dist <= tolerance]# Sort on ascending snap distance, so that closest goes to top
tmp = tmp.sort_values(by=["snap_dist"])
To find the closest line for each point, we can use groupby
:
# group by the index of the points and take the first, which is the
# closest line closest = tmp.groupby("pt_idx").first()
# construct a GeoDataFrame of the closest linesclosest = gp.GeoDataFrame(closest, geometry="geometry")
Whew! Now we have a GeoDataFrame
that has the closest line for every point. But we’re not done yet! We have only found the closest line — we haven’t found the correct position on that line.
3. Find the point on the line that is closest to the original point
To find out the nearest position on a line, we use two functions from GeoPandas (from shapely, which provides the spatial objects and operations for GeoPandas):
# Position of nearest point from start of the linepos = closest.geometry.project(gp.GeoSeries(closest.point))
# Get new point location geometrynew_pts = closest.geometry.interpolate(pos)
project
gives us the length from the start of the line to the closest point on that line to our original location. You can also use this for handy things like figuring out where to cut a line by a point.interpolate
takes that distance and gives us back a new point at the right location
Then it is just a couple more steps to join this back to our original points:
#Identify the columns we want to copy from the closest line to the point, such as a line ID.line_columns = [... ]# Create a new GeoDataFrame from the columns from the closest line and new point geometries (which will be called "geometries")snapped = gp.GeoDataFrame(
closest[line_columns],geometry=snapped_pts)
# Join back to the original points:updated_points = points.drop(columns=["geometry"]).join(snapped)# You may want to drop any that didn't snap, if so:updated_points = updated_ponts.dropna(subset=["geometry"])
Important concepts
- I didn’t mention it above, but it is important that both the lines and the points be in the same spatial projection, and that this spatial projection supports appropriate linear distance calculations.
- It may not be obvious, but each of the main operations above works in a roughly vectorized manner (not all are actually vectorized, but that is a detail best left for another time). This means that we process all rows through the same operation at the same time and produce a
Series
orGeoSeries
as output, rather than performing that operation individually for each row in a loop. This makes it much faster. - Making the above operations vectorized takes a bit of digging through the documentation for GeoPandas and shapely. You will likely encounter examples of performing the operations individually, or outside of GeoPandas in a regular Python loop. From there, you then need to think about how to adapt that same operation to create a
Series
or aGeoSeries
as output instead of a single shapely object. - There are probably even faster ways of performing the above snapping operations.
If you made it this far, you have now seen how to successfully snap points to the nearest lines from a linear network. From here, you can now do other network analyses. Go forth and analyze!
Wait! Your title mentioned faster — what is this faster compared to?
The first time I implemented this and got it successfully working, I used an apply
operation on the points dataset, and did everything else in the function that was passed to apply
for each point. Instead of a series of vectorized operations like I described above, this basically worked more like a for
loop. It worked reasonably well, and we ran it dozens of times, but it always bothered me that it was a little slower than I wanted — especially for something that we needed to update on every update of the input data (which has been getting more frequent).
In the original single apply
function, we:
- extracted x,y coordinates of the point
- calculated the bbox as a Python list
- used that bbox to query the spatial index for the list of line indexes
- queried the lines
GeoDataFrame
for those line indexes - calculated the distance between that point and each of the lines in returned above
- queried the closest line using the
nsmallest()
function - constructed a new
GeoSeries
to return from theapply
operation, which allowed me to return several new columns from that operation
Following the approach described in this article, we improved performance by about 3.5–4x.
Related posts:
- Quickly reading and writing intermediate files needed for snapping was an issue for me, so I wrote a tiny library and announced it here.
- Sofia Heisler provides a great explanation of the different ways of working through operations from loops to
apply.
- Markus Konrad provides a good walk-through of loops versus vectorized operations.