An Approach for Checking Overlaps and Gaps in Polygons using Geopandas

Geospatial data has become another main source data in data analytics. For example, when estimating population density, calculating perspiration area, creating network analysis, doing remote sensing, and others. For that analysis, we use geospatial data such as polygons, lines, points, or maybe rasters.

Before doing any analysis, have we ever noticed that the data we use is validated properly in term of geometry? For example for polygon, we must consider no overlaps and gaps found.

Example of overlaps and gaps

These errors can make a geometry miscalculation. For example, when we calculate the area of polygons, some value in the overlapped area is owned by two polygons, not just one polygon.

For checking the gaps and overlaps, we can use any GIS application, such as QGIS. But, here I want to show you another approach for checking overlaps and gaps errors using Geopandas. Geopandas give us the advantage to bring spatial data as dataframe and do any spatial analysis, manipulation, visualization, and other.

So, this is my approach for checking the overlaps and gaps.


Data Preparation

First, import the module used (Geopandas, Pandas, Shapely, and Matplotlib). Geopandas will be used to read the data and do spatial manipulation. Then, pandas will be used to do data frame manipulation that Geopandas does not cover. Shapely will be used to convert inner ring to LineString data (it will be explained below). Last, matplotlib will be used for visualization.

import geopandas as gpd
import pandas as pd
from shapely.geometry import LineString
import matplotlib.pyplot as
pltdata_temp=gpd.read_file('data/temp.shp')
data_temp.head()
data_temp.head()

Next, try to plot the data.

%matplotlib inline
data_temp.plot(figsize=(12, 12))
The plot of data_temp

After that, we can continue to check the overlaps and gap.

Checking the overlaps

Overlaps in shapefile can be identified by checking the intersections. Intersections can be defined as a subset of feature owned by two or more features. This intersection can be formed because of digitization error. For example, digitizing an area over the actual boundary that belongs to the other.

Here is workflow of overlaps checking:

Overlaps check workflow

First, open the shapefile as geo-dataframe with Geopandas module. Then, check whether the feature has intersections or not with other features, and repeat the step for every feature. Next, create a list of feature that have intersections and which feature is related. We can use the id of the feature to create the list. Next step is an optional method, we compare the intersection area with the threshold, such as 1e-9. When the value is smaller than the threshold, we can exclude that intersection from the list. Then, we create a unique combination of overlapping features. Last, we save these list into dataframe.

This is the implementation of that workflow in code.

data_overlaps=gpd.GeoDataFrame(crs=data_temp.crs)
for index, row in data_temp.iterrows():
data_temp1=data_temp.loc[data_temp.id!=row.id,]
# check if intersection occured
overlaps=data_temp1[data_temp1.geometry.overlaps(row.geometry)]['id'].tolist()
if len(overlaps)>0:
temp_list=[]
# compare the area with threshold
for y in overlaps:
temp_area=gpd.overlay(data_temp.loc[data_temp.id==y,],data_temp.loc[data_temp.id==row.id,],how='intersection')
temp_area=temp_area.loc[temp_area.geometry.area>=9e-9]
if temp_area.shape[0]>0:
data_overlaps=gpd.GeoDataFrame(pd.concat([temp_area,data_overlaps],ignore_index=True),crs=data_temp.crs)
# get unique of list id
data_overlaps['sorted']=data_overlaps.apply(lambda y: sorted([y['id_1'],y['id_2']]),axis=1)
data_overlaps['sorted']=data_overlaps.sorted.apply(lambda y: ''.join(y))
data_overlaps=data_overlaps.drop_duplicates('sorted')
data_overlaps=data_overlaps.reset_index()[['id_1','id_2','geometry']]
Overlaps result

The result of overlapped polygons (the red ones indicate the overlapped area).

%matplotlib inline
ax=data_temp.plot(figsize=(12, 12),alpha=0.7)
data_overlaps.plot(ax=ax,color='red')
Plotting the overlaps in original data

Checking the gaps

When we want to check the gaps in shapefile, we can extract information of gaps from exterior boundary and interior ring. The exterior boundary is sets of coordinates that form a boundary from one shapefile. Sometimes an exterior boundary can represent the boundary of an island. On the other hands, the interior ring is an inner ring that created inside the shapefile. This ring can be formed cause uncompleted digitized or represent area that not include in digitization purpose.

Gaps can be tolerated when it represents the boundary of the island. But, the gaps created by uncomplete digitation must not be tolerated. So here, I simplify the gaps checking to only check the inner rings that probably indicates uncomplete digitizing.

Here is the workflow,

Gaps check workflow

First, open the shapefile as geo-dataframe with Geopandas module. Then, dissolve the data into one polygon using ‘dissolve’. We can create one dummy variable that has the same value in every row. In my case, I use the ‘diss_id’. Next, we extract the inner ring. After that, the inner ring must be converted as LineString using shapely to get the exact coordinates of gaps. Then check the id of the feature in shapefile that touches the gap.

This is the implementation of that workflow in code.

data_temp_diss=data_temp.dissolve(by='diss_id')
interior=data_temp_diss.interiors.values.tolist()[0]
gap_list=[]
for i in interior:
gap_list.append(LineString(i))
data_gaps=gpd.GeoDataFrame(geometry=gap_list,crs=data_temp.crs)
data_gaps['feature_touches']=data_gaps.geometry.apply(lambda y: data_temp.loc[data_temp.touches(y)]['id'].tolist())
data_gaps.head()
Gaps result

The result of overlapped polygons (the red lines indicate the gap)

%matplotlib inline
ax=data_temp.plot(figsize=(12, 12),alpha=0.7)
data_gaps.plot(ax=ax,color='red')
Plotting the gaps in original data

That is my approach to check the gaps and overlaps of polygons in shapefile. After checking these error, we can repair it automatically or manually by re-digitizing. Actually, there any many kinds of errors in spatial data. Maybe next time I can write about them. Well, if you any suggestion or other tips, feel free to inform me. Thanks.