Geopandas makes it pretty easy to work with geospatial data in Python. One of its most powerful features is that it allows you to work with geospatial data using a similar approach to working with row-level data in Pandas, which is one of the fundamental data science libraries in Python. For example, you can apply an operation to every row, and often get major performance speedups compared to looping over those same rows.
However, one of the challenges I’ve bumped into with Geopandas is how long it takes to read or write standard spatial formats, especially ArcGIS File Geodatabases and Shapefiles. This can be especially problematic in projects that need to perform several sequential processing steps and save intermediate files to disk.
For example, as part of building the Southeast Aquatic Barrier Prioritization Tool for the Southeast Aquatic Resources Partnership, I have to read and process millions of lines (rivers and streams) from the National Hydrography Dataset.
These data are delivered as ArcGIS File Geodatabases for each of the 75 hydrologic sub-regions across the Southeast US. While I consolidate these into a few much larger files for the network analyses across the region, I still need to read and write the data a few times during the entire processing chain. I discovered early on that reading and writing shapefiles as intermediate files was unbearably slow, and involved a few undesirable data transformation steps due to limited support for data types and column names in that format. In addition to challenging one’s patience, slow I/O times also affected our iteration and troubleshooting process, as any change to the code could require a long time waiting on file I/O to find out if it worked properly.
After trying to find alternative ways of optimizing these intermediates, including storing attributes in CSV files and joining them back to the geospatial data in Geopandas, I came across the
feather format. Feather is an extremely fast and efficient file format for Pandas DataFrames, and has great support for all the data types I needed. It is faster to read and write than CSV files, and it preserves the data types.
feather doesn’t support geospatial objects out of the box. In Geopandas, these objects are
shapely objects, which are wrapper objects in Python that use the GEOS library for geospatial operations like
intersection, etc. However,
shapely objects support interoperability with Well Known Binary (WKB), which is a common binary representation of points, lines, or polygons.
I found that all I needed to do to store the geometry data in
feather was to convert the
shapely objects to WKB:
# Given a GeoDataFrame df and a geometry field "geometry"# Add a column containing the WKB
df["wkb"] = df.geometry.apply(lambda g: g.wkb)# Drop the original geometry column since we can't write it
df = df.drop(columns=["geometry"])# Use the GeoDataFrame's to_feather() function to write to a file
At this point, you might think we are done. However, because the world is not flat, we also need to store information about the spatial reference system that is used for these data. In Geopandas, this information is stored in the
crs property of a GeoDataFrame as either Proj4 strings or Python dictionaries. All we need to do is store this in a JSON-formatted file alongside the feather file created above.
geofeather library is a very lightweight provider of the functionality described above.
pip install geofeather
There are only two primary functions:
# Write a GeoDataFrame df to disk
to_geofeather(df, 'my-awesome-data.feather')# Read it from disk
df = from_geofeather('my-awesome-data.feather')
Is it for you?
This format is intended entirely for internal use, as fast read / write intermediate files in larger geospatial processing chains. It is not intended as a general-purpose spatial format for interoperability with the various spatial platforms out there. For those, you should use the existing read / write functionality available in Pandas, and use file formats that are supported in the spatial platform you are targeting.
What about performance?
I have not performed exhaustive benchmarks, but according to the super-simple benchmarks in our test suite, I am seeing about 1.5–2x speedups compared to reading shapefiles with Geopandas, and about 5–7x speedups compared to writing shapefiles with Geopandas.
I have not yet compared performance against other spatial formats.
What about file size?
This depends heavily on your data types since
feather will use the number of bytes required by the data type of each column (e.g.,
uint8 requires 1 byte per value). These data types may be represented with more or less space depending on the spatial format you are comparing to. Compared to the shapefiles I have tested with, my geofeather files are about 75% of the size of the shapefiles for the same data.