Speedy spatial data access using OGC Web Feature Service and Python

occassional data hobbist
3 min readMar 24, 2024

--

Location of Landslide Incidents in Hong Kong

Spatial data can be found in virtual infrastructure, the so-called Spatial Data Infrastructure (SDI). However, it normally has substantial data size, and thus, accessing tremendous spatial data online is always a challenge. To avoid overloading the system, such infrastructure would restrict people from accessing or downloading the entire dataset in one go.

Using the Location of Landslide Incidents in Hong Kong as an example, it has 10,072 records. However, the Web Feature Service (WFS) can return 10,000 records per query at most.

import geopandas as gpd
import requests
import geojson
from owslib.wfs import WebFeatureService
from urllib.parse import urlencode

# URL for WFS copied from CSDI portal
url = "https://portal.csdi.gov.hk/server/services/common/"
"cedd_rcd_1640249280153_17521/MapServer/WFSServer"

# Initialize owslib.wfs
wfs = WebFeatureService(url=url)

# Fetch the layer name using owslib.wfs
layer_name = list(wfs.contents)[-1]

# Specify the parameters for fetching the data
# Service: Service type, e.g. WFS, WMS
# Request: GetFeature operation supportred by WFS
# typeName: Layer name
# outputFormat: you guess.
# Count: specify amount of rows to return (e.g. 10000 or 100)
# startIndex: specifies at which offset to start returning rows
params = dict(
service='WFS',
version="2.0.0",
request='GetFeature',
typeName=layer_name,
outputFormat='geojson',
# count=2000,
startIndex=0)

query_params = urlencode(params)
query_url = f"{url}?{query_params}"

# Return a geoDataFrame
gdf = gpd.GeoDataFrame.from_features(geojson.loads
(requests.get(query_url).content), crs="EPSG:4326")

# Return 10,000 for each attributes even if no. of records exceed 10,000
gdf.count()

To retrieve the remaining 70 records, we need to use a paging query with the total, count or step and repeat the query until all records can be retrieved.

import geopandas as gpd
import requests
import geojson
from owslib.wfs import WebFeatureService
from urllib.parse import urlencode

# URL for WFS from CSDI portal
url = "https://portal.csdi.gov.hk/server/services/common/cedd_rcd_1640249280153_17521/MapServer/WFSServer"

# Total can be retrieved by the specifying resultType="hits"
# in the GetFeature request.operator.
total= 10070
step = 10000

query_params = dict(
service="WFS",
version="2.0.0",
request="GetFeature",
typeName=layer_name,
outputFormat="GEOJSON"
)

# Loop through each page of query results
gdfs = []
for offset in range(0, total, step):
query_params['startIndex'] = offset
offset_query = urlencode(query_params)
offset_query_url = f"{url}?{offset_query}"
offset_gdf = gpd.GeoDataFrame.from_features(geojson.loads
(requests.get(offset_query_url).content), crs="EPSG:4326")
gdfs.append(offset_gdf)

# Concatenate the resulting dataframes
gdf = gpd.pd.concat(gdfs, ignore_index=True)

To retrieve all records, we can use the paging query feature. I tested it takes approximately 6.9 seconds. However, we can speed up the process by using concurrent execution in Python. To do this, we need to define a function that can be submitted to the thread pool and append the future instances from as_completed whenever it is done, regardless of the order. This method should be effective. Here is the code.

from concurrent.futures import ThreadPoolExecutor, as_completed
import geopandas as gpd
from urllib.parse import urlencode

def return_result(_url):
_offset_gdf = gpd.GeoDataFrame.from_features
(geojson.loads(requests.get(_url).content), crs="EPSG:4326")
return _offset_gdf

if __name__ == '__main__':

# Specify parameters (read data in json format).
query_params = dict(
service="WFS",
version="2.0.0",
request="GetFeature",
typeName=layer_name,
outputFormat="GEOJSON",
count=step
)
# Loop through each page of query results concurrently
with ThreadPoolExecutor(max_workers=5) as executor:
futures = []
for offset in range(0, total, step):
# Create each offset query
query_params['startIndex'] = offset
encoded_args = urlencode(query_params)
offset_query_url = f"{url}?{encoded_args}"

# Submit each query to the thread
submit = executor.submit(return_result, offset_query_url)
futures.append(submit)

# Adding result as list in gdfs once complete
gdfs = [future.result() for future in as_completed(futures)]

# Concatenate the resulting dataframes
gdf= gpd.pd.concat(gdfs, ignore_index=True)
gdf.count()

I ran the concurrent code, and the time it took to complete the task was reduced to 4.7 seconds, an improvement of over 30%. This improvement may not be significant for a small data size, but it will be noticeable if the number of records to be retrieved is 100,000 or more. Moreover, if a smaller step size, say 5,000, is used, the processing time could be further reduced, allowing for the use of additional worker threads.

--

--