How to extract every feature from an ESRI map service using Python

Rid yourself of server throttling woes!

Jesse Nestler
4 min readFeb 9, 2024
A mess of incidents: all of Boulder, CO’s emergency 911 call locations since 2016. Image by the author.

Over the last year, I have engaged in many projects that required extracting geospatial data from ESRI REST endpoints. While geopandas makes it exceptionally easy to read data directly from a URL, it can only extract a finite number of features as defined by the server’s query properties.

Translation: if the data you’re querying has more features than the server allows from a single request, you have to get creative.

And creative we shall get! In this post, I’ll show you how to extract every last feature from an ESRI map service, with a bonus function leveraging concurrency 👀.

Before diving in to a solution, let’s understand how geopandas currently handles queries to a map service. Let’s say we navigate to an Open Data portal that exposes an ESRI map service for queries. We copy-paste the geojson query, and plop that into geopandas’ handy read_file function:

import geopandas as gpd

QUERY_URL = ("https://maps.bouldercounty.org/arcgisadmin/rest/services/"
"Transportation/STREETS/MapServer/0/query?"
"outFields=*&where=1%3D1&f=geojson")

gdf = gpd.read_file(QUERY_URL)

This works, but we quickly realize that the length of that dataframe is (in the majority of cases) 1000 records, despite the dataset itself having close to 32,000 records. The reason revolves around the server limiting the number of records returned by a query, and at time of writing, geopandas doesn’t handle pagination.

When querying a map service, we have to establish whether or not the service allows queries at all. Sounds silly, but this is a setting that server administrators can toggle, and if they decide to disable query capabilities, our extraction is dead in the water. For the purposes of this article, we’ll use a service that supports query capabilities.

The next two pieces of information we need are the total count of records in the service, and the query result limitation, aka the step size, aka the maxRecordCount.

import requests

# Service endpoint for Boulder County street centerlines
SERVICE_URL = ("https://maps.bouldercounty.org/arcgisadmin/rest/services/"
"Transportation/STREETS/MapServer/0")
QUERY_URL = f"{SERVICE_URL}/query"

# Streets in the City of Boulder
where = "LCOM = 'BOULDER' AND RCOM = 'BOULDER'"

# Determine the total number of records for the given where clause
count_params = {
"where": where,
"returnCountOnly": True,
"f": "json"
}
query_json = requests.get(QUERY_URL, params=count_params).json()
tot_records = query_json["count"]

# Determine the step size for pages
service_json = requests.get(SERVICE_URL, params={'f': 'json'}).json()
step = service_json["maxRecordCount"]

Note that the step size is a setting for the service itself, while the total number of records is a function of your where clause and the underlying data being served. Thus, getting the step size requires getting information from the service endpoint, whereas the count goes to the query endpoint of the service.

With those parameters in hand, we can now extract our data from the query endpoint of the service by paginating through the result set:

import geopandas as gpd

# Fields we want in the output
fields = ('OBJECTID', 'PREFIX', 'NAME', 'STREETTYPE', 'SUFFIX')

# Define query parameters
query_params = {
"where": where,
"outFields": ', '.join(fields),
"outSr": 2876, # Boulder's local state plane projection
"f": "geojson", # Format that geopandas can understand
"orderByFields": "OBJECTID", # Avoid duplicates when offsetting
"returnGeometry": True,
"resultRecordCount": step
}

# Loop through each page of query results
gdfs = []
for offset in range(0, tot_records, step):
query_params['resultOffset'] = offset
offset_query = urllib.parse.urlencode(query_params)
offset_query_url = f"{QUERY_URL}?{offset_query}"
offset_gdf = gpd.read_file(offset_query_url)
gdfs.append(offset_gdf)

# Concatenate the resulting dataframes
gdf = gpd.pd.concat(gdfs, ignore_index=True)

# Verify no OBJECTIDs are duplicated
dups = gdf[gdf['OBJECTID'].is_duplicated()]
dups.empty

This particular query executes in about 2.3 seconds. If we don’t supply a where clause or a set of fields to limit the results, the query jumps to 26.2 seconds. Not bad per-se, but what if the endpoint we’re querying contains hundreds of thousands of records, like the City of Boulder’s Open Data for Fire Response Times? Well, I ran this code on that service for science, and it took a whopping 40 minutes!! We can do better.

Cue concurrency, and honestly, I don’t pretend to understand it, but I do know that it speeds up whatever I throw at it! Let’s see how it can serve us in this particular case.

from concurrent.futures import ThreadPoolExecutor, as_completed

import geopandas as gpd


def query_map_service(service_url, where='1=1', fields='*', crs=4326):
# Define the query endpoint
query_url = f"{service_url}/query"

# Determine the total number of records for the given where clause
count_params = {
"where": where,
"returnCountOnly": True,
"f": "json"
}
tot_records_json = requests.get(query_url, params=count_params).json()
tot_records = tot_records_json["count"]

# Determine the step size for pages
step_json = requests.get(service_url, params={'f': 'json'}).json()
step = step_json["maxRecordCount"]

# Define query parameters
query_params = {
"where": where,
"outFields": ', '.join(fields) if isinstance(fields, list) else fields,
"outSr": crs,
"f": "geojson",
"orderByFields": "OBJECTID",
"returnGeometry": True,
"resultRecordCount": step
}

# Loop through each page of query results concurrently
with ThreadPoolExecutor(max_workers=4) as ex:
futures = []
for offset in range(0, tot_records, step):
# Create each offset query
query_params['resultOffset'] = offset
offset_query = urllib.parse.urlencode(query_params)
offset_query_url = f"{query_url}?{offset_query}"

# Submit each query to the thread
submit = ex.submit(gpd.read_file, offset_query_url)
futures.append(submit)

# Compile all futures once they're complete
gdfs = [f.result() for f in as_completed(futures)]

# Concatenate the resulting dataframes
gdf = gpd.pd.concat(gdfs, ignore_index=True)

return gdf

Running this function on the full Boulder County street centerlines data takes 10.1 seconds, which is a 61% decrease in time taken. Nice! Meanwhile, what once took 40 minutes for Boulder’s emergency incidents now takes 19.5 minutes. Still slow, but halving the original time is pretty impressive! I’d love to hear if any folks out there have even better improvements!

--

--

Jesse Nestler

🗺️ Geographic data analysis 🐍 Python 💫 Visualization 🦄 Imagination