# How to leverage Geopandas for faster snapping of points to lines

Needing to correctly align one set of data with another is a common task in many geospatial analyses. One example of this is snapping a set of locations** **to a linear network. This is usually an essential step in network analyses because you need to know the exact location of that thing on a given network. For instance, in order to calculate the travel distance to the nearest electric scooter, you need to know where it is within a city’s grid network of streets. If you need to know how many miles of river are upstream of a dam, you need to know where that dam is located on the river network.

In both cases, there is usually some measurement error in the location of the scooter or the dam, and it may not correctly align with the network. Or the location is correct, but the data source for the network has measurement error. Or both are wrong, which is often the case. Take your pick.

To get around this issue, you need to “snap” the locations to the nearest point on the network. This involves a few steps:

- Find the nearest line to the original location
- Find the nearest point on that line to the original location
- Move the original location to that point
- Proceed with the network analysis

This is relatively straightforward for a few points and lines but can be a slow operation if you have to do this with lots of data. In my case, I needed a reasonably fast way to snap the locations of hundreds of thousands of dams and culverts against aquatic networks with millions of line segments. I have been working with the Southeast Aquatic Resources Partnership to help analyze the relationships between aquatic networks and aquatic barriers like dams as part of building the Southeast Aquatic Barrier Prioritization Tool. In order to analyze aquatic connectivity, we first needed to overcome the fact that the locations of many dams were not actually on the latest available aquatic network data from the National Hydrography Dataset. You would think everything would be in the correct place, right? Nope.

# Using Geopandas to snap points to linear networks

My tool of choice for processing these data is Geopandas, which provides a very nice interface for working with geometric data in Pandas in Python. Geopandas represents every feature — such as a dam — as a row with both attribute data, like the name of the dam, and the associated geometry representing its location.

One of the most powerful features is the ability to `apply`

an operation to each row in the dataset. If done properly, this is significantly faster than looping over each row and performing a similar operation. Sofia Heisler does a great job of describing loops vs `apply`

. However, I’ve found that correctly leveraging `apply`

and related approaches involves thinking about the problem inside out — which usually means I spend brain hours trying to figure out how to reduce CPU hours.

# 1. Use with a spatial index and find nearby lines

In order to find the nearest lines on a network to a given location, you need a spatial index. Otherwise, you would have to compare every location to every line — even the lines far away — and waste precious time calculating distances that are not useful. A spatial index provides you with the ability to quickly select only those features that overlap a certain bounding box, which greatly cuts down on the number of comparisons you need to make.

In our example, we need a spatial index on the lines we are snapping to. In Geopandas, this is created using the `sindex`

properly of a `GeoDataFrame`

called `lines`

:

# this creates and also provides us access to the spatial indexlines.sindex

Now we can query this spatial index using a bounding box to get a list of the lines that overlap that bounding box. Great!

Oh wait — we’re dealing with points — and points don’t have bounding boxes! What are we to do?

We can generate bounding boxes by applying an `offset`

to the `bounds`

property of a `GeoDataFrame`

called `points`

:

offset = 100bbox = points.bounds + [-offset, -offset, offset, offset]

Let me break this down:

`points.bounds`

is the GeoPandas accessor for the bounds of a geometry — even points — which gives back 4 columns:

` minx miny maxx maxy`

0 1394202.875 165652.562500 1394202.875 165652.562500

1 1393760.375 165302.625000 1393760.375 165302.625000

...

- adding a list of values to it that is the same number of columns allows us to quickly add or subtract values from each column. In this case, we subtract
`tolerance`

from the`minx`

and`miny`

columns, and add tolerance to the`maxx`

and`maxy`

columns. Now we have values that will work for a bounding box that is +/-`tolerance`

around our points. - we are using
`offset`

as the distance within which we are willing to snap points to a line, and beyond which we consider them too far away and drop them from the analysis.

For a tolerance of 100 (meters, in the spatial projection of our data), this gives us:

` minx miny maxx maxy`

0 1394102.875 165552.562500 1394302.875 165752.562500

1 1393660.375 165202.625000 1393860.375 165402.625000

...

Now, we can `apply`

an operation to this `bbox`

to find get a list of the lines that overlap:

`hits = bbox.apply(lambda row: list(lines.sindex.intersection(row)), axis=1)`

Getting into the full details of how `apply`

works is pretty involved, so I’ll keep this short:

`lambda row: …`

is an inline function that is called for every row in`bbox`

`axis=1`

means apply this operation based on the values in the columns of each row, in this case`[minx, miny, maxx, maxy`

]`lines.sindex.intersection()`

queries the spatial index on the lines and gives back a list of line indexes. The spatial index itself provides a generator, so we use a`list`

to extract those values all out at once. NOTE: it does not give back the`index`

on the lines`GeoDataFrame`

, if you have one defined, it gives back the ordinal index — position in the list — of each line. This will make a bit more sense below.

This gives us:

`...`

3 []

4 [71]

5 [71, 42, 61, 38]

...

This shows that the 5th point has 4 lines that might be within `tolerance`

distance to that point. This still isn’t in the most usable structure yet, so we need to transform the data into a longer set of rows:

import pandas as pd

import numpy as nptmp = pd.DataFrame({

# index of points table

"pt_idx": np.repeat(hits.index, hits.apply(len)), # ordinal position of line - access via iloc later

"line_i": np.concatenate(hits.values)

})

This gives us a new, longer `DataFrame`

that has one row for every element in the list of line indexes we got back from the spatial index. It implicitly drops those that didn’t have any lines within their window.

We use `np.repeat`

to repeat the `index`

value of the `points`

dataset based on the length of the number of hits for that point. We want the `index`

so we can join back to `points`

later. We use `np.concatenate`

to combine the lists of hits into a single column. `values`

gives us the numpy array of the values in the `hits`

Series, which is what numpy wants.

Now we have:

` pt_idx line_i`

0 4 71

1 5 71

2 5 42

...

Before we can do any spatial operations, we need to join back in the geometries:

# Join back to the lines on line_i; we use reset_index() to

# give us the ordinal position of each linetmp = tmp.join(lines.reset_index(drop=True), on="line_i")# Join back to the original points to get their geometry

# rename the point geometry as "point"tmp = tmp.join(points.geometry.rename("point"), on="pt_idx")# Convert back to a GeoDataFrame, so we can do spatial opstmp = gp.GeoDataFrame(tmp, geometry="geometry", crs=points.crs)

# 2. Find the closest line to each point

Now we have a `GeoDataFrame`

again and we can calculate the distance between each point and each of its associated lines to find the closest one:

`tmp["snap_dist"] = tmp.geometry.distance(gp.GeoSeries(tmp.point))`

We use the builtin `distance`

method on the line geometry of each row, and pass in a `GeoSeries`

of the point geometries. This is actually doing a vectorized operation against the entire set of rows, so it is reasonably fast (much faster than looping).

Now that we have the distance, we can sort on it — while discarding any :

# Discard any lines that are greater than tolerance from points

tmp = tmp.loc[tmp.snap_dist <= tolerance]# Sort on ascending snap distance, so that closest goes to top

tmp = tmp.sort_values(by=["snap_dist"])

To find the closest line for each point, we can use `groupby`

:

# group by the index of the points and take the first, which is the

# closest line closest = tmp.groupby("pt_idx").first()

# construct a GeoDataFrame of the closest linesclosest = gp.GeoDataFrame(closest, geometry="geometry")

Whew! Now we have a `GeoDataFrame`

that has the closest line for every point. But we’re not done yet! We have only found the closest line — we haven’t found the correct position on that line.

# 3. Find the point on the line that is closest to the original point

To find out the nearest position on a line, we use two functions from GeoPandas (from shapely, which provides the spatial objects and operations for GeoPandas):

# Position of nearest point from start of the linepos = closest.geometry.project(gp.GeoSeries(closest.point))

# Get new point location geometrynew_pts = closest.geometry.interpolate(pos)

`project`

gives us the length from the start of the line to the closest point on that line to our original location. You can also use this for handy things like figuring out where to cut a line by a point.`interpolate`

takes that distance and gives us back a new point at the right location

Then it is just a couple more steps to join this back to our original points:

#Identify the columns we want to copy from the closest line to the point, such as a line ID.line_columns = [... ]# Create a new GeoDataFrame from the columns from the closest line and new point geometries (which will be called "geometries")snapped = gp.GeoDataFrame(

closest[line_columns],geometry=snapped_pts)

# Join back to the original points:updated_points = points.drop(columns=["geometry"]).join(snapped)# You may want to drop any that didn't snap, if so:updated_points = updated_ponts.dropna(subset=["geometry"])

# Important concepts

- I didn’t mention it above, but it is important that both the lines and the points be in the same spatial projection, and that this spatial projection supports appropriate linear distance calculations.
- It may not be obvious, but each of the main operations above works in a roughly vectorized manner (not all are actually vectorized, but that is a detail best left for another time). This means that we process all rows through the same operation at the same time and produce a
`Series`

or`GeoSeries`

as output, rather than performing that operation individually for each row in a loop. This makes it much faster. - Making the above operations vectorized takes a bit of digging through the documentation for GeoPandas and shapely. You will likely encounter examples of performing the operations individually, or outside of GeoPandas in a regular Python loop. From there, you then need to think about how to adapt that same operation to create a
`Series`

or a`GeoSeries`

as output instead of a single shapely object. - There are probably even faster ways of performing the above snapping operations.

If you made it this far, you have now seen how to successfully snap points to the nearest lines from a linear network. From here, you can now do other network analyses. Go forth and analyze!

## Wait! Your title mentioned faster — what is this faster compared to?

The first time I implemented this and got it successfully working, I used an `apply`

operation on the points dataset, and did everything else in the function that was passed to `apply`

for each point. Instead of a series of vectorized operations like I described above, this basically worked more like a `for`

loop. It worked reasonably well, and we ran it dozens of times, but it always bothered me that it was a little slower than I wanted — especially for something that we needed to update on every update of the input data (which has been getting more frequent).

In the original single `apply`

function, we:

- extracted x,y coordinates of the point
- calculated the bbox as a Python list
- used that bbox to query the spatial index for the list of line indexes
- queried the lines
`GeoDataFrame`

for those line indexes - calculated the distance between that point and each of the lines in returned above
- queried the closest line using the
`nsmallest()`

function - constructed a new
`GeoSeries`

to return from the`apply`

operation, which allowed me to return several new columns from that operation

Following the approach described in this article, we improved performance by about 3.5–4x.

## Related posts:

- Quickly reading and writing intermediate files needed for snapping was an issue for me, so I wrote a tiny library and announced it here.
- Sofia Heisler provides a great explanation of the different ways of working through operations from loops to
`apply.`

- Markus Konrad provides a good walk-through of loops versus vectorized operations.