How to leverage Geopandas for faster snapping of points to lines

10 min readOct 6, 2019

Photo by Bence Balla-Schottner on Unsplash

Needing to correctly align one set of data with another is a common task in many geospatial analyses. One example of this is snapping a set of locations to a linear network. This is usually an essential step in network analyses because you need to know the exact location of that thing on a given network. For instance, in order to calculate the travel distance to the nearest electric scooter, you need to know where it is within a city’s grid network of streets. If you need to know how many miles of river are upstream of a dam, you need to know where that dam is located on the river network.

In both cases, there is usually some measurement error in the location of the scooter or the dam, and it may not correctly align with the network. Or the location is correct, but the data source for the network has measurement error. Or both are wrong, which is often the case. Take your pick.

To get around this issue, you need to “snap” the locations to the nearest point on the network. This involves a few steps:

Find the nearest line to the original location
Find the nearest point on that line to the original location
Move the original location to that point
Proceed with the network analysis

This is relatively straightforward for a few points and lines but can be a slow operation if you have to do this with lots of data. In my case, I needed a reasonably fast way to snap the locations of hundreds of thousands of dams and culverts against aquatic networks with millions of line segments. I have been working with the Southeast Aquatic Resources Partnership to help analyze the relationships between aquatic networks and aquatic barriers like dams as part of building the Southeast Aquatic Barrier Prioritization Tool. In order to analyze aquatic connectivity, we first needed to overcome the fact that the locations of many dams were not actually on the latest available aquatic network data from the National Hydrography Dataset. You would think everything would be in the correct place, right? Nope.

Hoosier Dam (prior to removal). Photo: Emily Wells, USFWS, 2017. Removed in 2018 to improve fish habitat.

Using Geopandas to snap points to linear networks

My tool of choice for processing these data is Geopandas, which provides a very nice interface for working with geometric data in Pandas in Python. Geopandas represents every feature — such as a dam — as a row with both attribute data, like the name of the dam, and the associated geometry representing its location.

One of the most powerful features is the ability to apply an operation to each row in the dataset. If done properly, this is significantly faster than looping over each row and performing a similar operation. Sofia Heisler does a great job of describing loops vs apply. However, I’ve found that correctly leveraging apply and related approaches involves thinking about the problem inside out — which usually means I spend brain hours trying to figure out how to reduce CPU hours.

1. Use with a spatial index and find nearby lines

In order to find the nearest lines on a network to a given location, you need a spatial index. Otherwise, you would have to compare every location to every line — even the lines far away — and waste precious time calculating distances that are not useful. A spatial index provides you with the ability to quickly select only those features that overlap a certain bounding box, which greatly cuts down on the number of comparisons you need to make.

In our example, we need a spatial index on the lines we are snapping to. In Geopandas, this is created using the sindex properly of a GeoDataFrame called lines:

# this creates and also provides us access to the spatial indexlines.sindex

Now we can query this spatial index using a bounding box to get a list of the lines that overlap that bounding box. Great!

Oh wait — we’re dealing with points — and points don’t have bounding boxes! What are we to do?

We can generate bounding boxes by applying an offset to the bounds property of a GeoDataFrame called points:

offset = 100bbox = points.bounds + [-offset, -offset, offset, offset]

Let me break this down:

points.bounds is the GeoPandas accessor for the bounds of a geometry — even points — which gives back 4 columns:

    minx         miny           maxx         maxy
0   1394202.875  165652.562500  1394202.875  165652.562500
1   1393760.375  165302.625000  1393760.375  165302.625000
...

adding a list of values to it that is the same number of columns allows us to quickly add or subtract values from each column. In this case, we subtract tolerance from the minx and miny columns, and add tolerance to the maxx and maxy columns. Now we have values that will work for a bounding box that is +/- tolerance around our points.
we are using offset as the distance within which we are willing to snap points to a line, and beyond which we consider them too far away and drop them from the analysis.

For a tolerance of 100 (meters, in the spatial projection of our data), this gives us:

    minx         miny           maxx         maxy
0   1394102.875  165552.562500  1394302.875  165752.562500
1   1393660.375  165202.625000  1393860.375  165402.625000
...

Now, we can apply an operation to this bbox to find get a list of the lines that overlap:

hits = bbox.apply(lambda row: list(lines.sindex.intersection(row)), axis=1)

Getting into the full details of how apply works is pretty involved, so I’ll keep this short:

lambda row: … is an inline function that is called for every row in bbox
axis=1 means apply this operation based on the values in the columns of each row, in this case [minx, miny, maxx, maxy]
lines.sindex.intersection() queries the spatial index on the lines and gives back a list of line indexes. The spatial index itself provides a generator, so we use a list to extract those values all out at once. NOTE: it does not give back the index on the lines GeoDataFrame, if you have one defined, it gives back the ordinal index — position in the list — of each line. This will make a bit more sense below.

This gives us:

...
3                   []
4                 [71]
5     [71, 42, 61, 38]
...

This shows that the 5th point has 4 lines that might be within tolerance distance to that point. This still isn’t in the most usable structure yet, so we need to transform the data into a longer set of rows:

import pandas as pd
import numpy as nptmp = pd.DataFrame({
    # index of points table
    "pt_idx": np.repeat(hits.index, hits.apply(len)),    # ordinal position of line - access via iloc later
    "line_i": np.concatenate(hits.values)
})

This gives us a new, longer DataFrame that has one row for every element in the list of line indexes we got back from the spatial index. It implicitly drops those that didn’t have any lines within their window.

We use np.repeat to repeat the index value of the points dataset based on the length of the number of hits for that point. We want the index so we can join back to points later. We use np.concatenate to combine the lists of hits into a single column. values gives us the numpy array of the values in the hits Series, which is what numpy wants.

Now we have:

      pt_idx  line_i
0        4    71
1        5    71
2        5    42
...

Before we can do any spatial operations, we need to join back in the geometries:

# Join back to the lines on line_i; we use reset_index() to 
# give us the ordinal position of each linetmp = tmp.join(lines.reset_index(drop=True), on="line_i")# Join back to the original points to get their geometry
# rename the point geometry as "point"tmp = tmp.join(points.geometry.rename("point"), on="pt_idx")# Convert back to a GeoDataFrame, so we can do spatial opstmp = gp.GeoDataFrame(tmp, geometry="geometry", crs=points.crs)

2. Find the closest line to each point

Now we have a GeoDataFrame again and we can calculate the distance between each point and each of its associated lines to find the closest one:

tmp["snap_dist"] = tmp.geometry.distance(gp.GeoSeries(tmp.point))

We use the builtin distance method on the line geometry of each row, and pass in a GeoSeries of the point geometries. This is actually doing a vectorized operation against the entire set of rows, so it is reasonably fast (much faster than looping).

Now that we have the distance, we can sort on it — while discarding any :

# Discard any lines that are greater than tolerance from points
tmp = tmp.loc[tmp.snap_dist <= tolerance]# Sort on ascending snap distance, so that closest goes to top
tmp = tmp.sort_values(by=["snap_dist"])

To find the closest line for each point, we can use groupby:

# group by the index of the points and take the first, which is the
# closest line closest = tmp.groupby("pt_idx").first()
# construct a GeoDataFrame of the closest linesclosest = gp.GeoDataFrame(closest, geometry="geometry")

Whew! Now we have a GeoDataFrame that has the closest line for every point. But we’re not done yet! We have only found the closest line — we haven’t found the correct position on that line.

3. Find the point on the line that is closest to the original point

To find out the nearest position on a line, we use two functions from GeoPandas (from shapely, which provides the spatial objects and operations for GeoPandas):

# Position of nearest point from start of the linepos = closest.geometry.project(gp.GeoSeries(closest.point))
# Get new point location geometrynew_pts = closest.geometry.interpolate(pos)

project gives us the length from the start of the line to the closest point on that line to our original location. You can also use this for handy things like figuring out where to cut a line by a point.
interpolate takes that distance and gives us back a new point at the right location

Then it is just a couple more steps to join this back to our original points:

#Identify the columns we want to copy from the closest line to the point, such as a line ID.line_columns = [... ]# Create a new GeoDataFrame from the columns from the closest line and new point geometries (which will be called "geometries")snapped = gp.GeoDataFrame(
closest[line_columns],geometry=snapped_pts)
# Join back to the original points:updated_points = points.drop(columns=["geometry"]).join(snapped)# You may want to drop any that didn't snap, if so:updated_points = updated_ponts.dropna(subset=["geometry"])

Important concepts

I didn’t mention it above, but it is important that both the lines and the points be in the same spatial projection, and that this spatial projection supports appropriate linear distance calculations.
It may not be obvious, but each of the main operations above works in a roughly vectorized manner (not all are actually vectorized, but that is a detail best left for another time). This means that we process all rows through the same operation at the same time and produce a Series or GeoSeries as output, rather than performing that operation individually for each row in a loop. This makes it much faster.
Making the above operations vectorized takes a bit of digging through the documentation for GeoPandas and shapely. You will likely encounter examples of performing the operations individually, or outside of GeoPandas in a regular Python loop. From there, you then need to think about how to adapt that same operation to create a Series or a GeoSeries as output instead of a single shapely object.
There are probably even faster ways of performing the above snapping operations.

If you made it this far, you have now seen how to successfully snap points to the nearest lines from a linear network. From here, you can now do other network analyses. Go forth and analyze!

Wait! Your title mentioned faster — what is this faster compared to?

The first time I implemented this and got it successfully working, I used an apply operation on the points dataset, and did everything else in the function that was passed to apply for each point. Instead of a series of vectorized operations like I described above, this basically worked more like a for loop. It worked reasonably well, and we ran it dozens of times, but it always bothered me that it was a little slower than I wanted — especially for something that we needed to update on every update of the input data (which has been getting more frequent).

In the original single apply function, we:

extracted x,y coordinates of the point
calculated the bbox as a Python list
used that bbox to query the spatial index for the list of line indexes
queried the lines GeoDataFrame for those line indexes
calculated the distance between that point and each of the lines in returned above
queried the closest line using the nsmallest() function
constructed a new GeoSeries to return from the apply operation, which allowed me to return several new columns from that operation

Following the approach described in this article, we improved performance by about 3.5–4x.

Quickly reading and writing intermediate files needed for snapping was an issue for me, so I wrote a tiny library and announced it here.
Sofia Heisler provides a great explanation of the different ways of working through operations from loops to apply.
Markus Konrad provides a good walk-through of loops versus vectorized operations.