How to improve geospatial data manipulation with mongoDB

Letiziapichon
iNex Blog
Published in
5 min readNov 30, 2022

iNex helps the construction industry by developing digital solutions that aggregate a wide variety of land data on a single platform.

As a data engineer, part of our job is to create pipelines that will retrieve opendata, such as Geoportail de l’urbanisme, process it and then store it on our database. The majority of this data is geospatial such as a buildings or plots of land. And we use MongoDB to deal with it.

How to best manage and process geospatial data while maximizing MongoDB capabilities ?

The advantages of handling geospatial data with MongoDB

MongoDB is a great tool to store this kind of data in a geojson format as a “Point”, a “Polygon” or a “Multipolygon”. Its main advantage is that it provides tools through specific keywords to query this type of data.

Geojson on MongoDB Compass

In order to use these queries, mongodb needs to know that the field ‘geometry’ is a geometric variable. To do that, we need to create an index on this field.

Creating a 2dsphere Index

To create an index, you must first create a collection and insert at least one element in it. For geometric data you have to make sure that the inserted data is in a valid geojson format (example photo above).

Once our element is inserted, we can go to the “Indexes” tab and click on “CREATE INDEX”.

Indexes Tab on MongoDB Compass

Then set the index definition to “geometry” and the index type ‘2dsphere’.

Create an index on MongoDB Compass

If an error occurs during the creation of the index it means that either the inserted geojson is not in the right format or one of the geometries is incorrect. For example MongoDB will not accept the geometry in the image below because two segments are crossing. Later in this article you will find a way to correct the geometries.

Exemple of an incorrect geometry

Geospatial queries

Now that our index is created we can use the following queries.

GeoWithin

This MongoDB keyword allows us to select elements that are entirely contained in the given geospatial data. Therefore, if an element is partially present in the defined bounds, it will not be selected. This allows us to perform the same task as the within function of the geopandas library in python.

# exemple of a query with geoWithin
mongo_collection.find(
{"geometry": {"$geoWithin": {"$geometry": geojson_to_cross}}}
)

GeoIntersects

GeoIntersects performs a query close to Geowithin. Indeed, it returns the same geometries as the function ‘GeoWithin’, but it also returns the geometries that intersect the bounds set by our geojson (similar to the ‘intersect’ function in Geopandas).

# exemple of a query with geoIntersects
mongo_collection.find(
{"geometry": {"$geoIntersects": {"$geometry": geojson_to_cross}}}
)

Near

The Near keyword returns elements from nearest to farthest from the specified point. In geopandas, we can have the same result by calculating the distance for each geometry. The process is therefore longer with geopandas than with mongodb.

# exemple of a query with Near
mongo_collection.find(
{"geometry": {"$near": {"$geometry": geojson_point , "$maxDistance": 200}}}
)

The ‘$maxDistance’ argument is optional and express in meters

Correting data with incorrect geometry

When collecting open data, the geometry might be incorrect. As we mentionned when creating the index, an erroneous geometry cannot be inserted on mongo.

There are two main types of errors. We can use the code in python below to correct them.

How to detect the error type

To correct the error we first need to identify it. To achieve this, we need to intercept the error coming from mongo when inserting data with python.

We can recreate an error by building an index on the following multipolygon:

# Exemple of a geojson with incorrect edges in the middle of the geometry
{
"properties": {},
"geometry": {
"type": "Polygon",
"coordinates": [[
[2.52460760194092, 50.95605670516066],
[2.529651020828509, 50.95202942954249],
[2.529628005044639, 50.95202375369679],
[2.528513959639505, 50.95174915427584],
[2.525168256108671, 50.95477944337451],
[2.525256056398724, 50.95469992655801],
[2.525310248343852, 50.95473395382295],
[2.525230124934766, 50.95480651845794],
[2.52460760194092, 50.95605670516066]
]]
}
}

MongoDB will return this error:

Edges 3 and 5 cross. Edge locations in degrees: [2.5285140, 50.9517492]-[2.5251683, 50.9547794] and [2.5252561, 50.9546999]-[2.5253102, 50.9547340]

You can use the following snippet to intercept the MongoDB error in python and to find the problematic edges in our geometrical figure.

# try to insert geometry in our collection on MongoDB
try:
collection_mongo.insert_one(geometry)
except WriteError as e:
# intercept mongo error
error_message = e.details.get("errmsg", "")
# Find the edges causing the error
edges = re.findall("Edges ([0-9]+?) and ([0-9]+?) cross", error_message)

# retrive all the coordinates from error message
list_coord = literal_eval(
re.search("Loop is not valid:(.*?) Edges", error_message).group(1).replace(" ", "")
)

edges = [int(ed) for ed in edges[0]]

Incorrect edges in the middle of the geometry

As we could see with the multipolygon and the error above, the edges that cause the error are in the third and fifth place. They are therefore in the middle of our geometry.

We can correct this geometry with the following piece of code. This code will allow us to make sure that our geometry is in the right format (list of coordinates) and to remove the edge that causes the error (in our example edge 4).

# Fix polygon with incorrect edges
if row_to_insert["geometry"]["type"] == "Polygon":
for idx in range(len(row_to_insert["geometry"]["coordinates"])):
if list(map(
lambda x: list((float(f"{x[0]:.15f}"), float(f"{x[1]:.14f}"))),
row_to_insert["geometry"]["coordinates"][idx]
)) == list_coord:

row_to_insert["geometry"]["coordinates"] = list(row_to_insert["geometry"]["coordinates"])
row_to_insert["geometry"]["coordinates"][idx] = list(row_to_insert["geometry"]["coordinates"][idx])

# delete edge crossing for each polygon
row_to_insert["geometry"]["coordinates"][idx].pop(edges[0] + 1)

Incorrect edges in the beginning of the geometry

The crossing edges can also be the first and the last edge as for the following multipolygon:

# Exemple of a geojson with incorrect edges in the beginning of the geometry
{
"geometry": {
"type": "Polygon",
"coordinates": [
[
[2.52460760194092, 50.95605670516066],
[2.524954170439705, 50.9562501457054],
[2.525206527677853, 50.95636241761319],
[2.526022919458763, 50.95672561783388],
[2.526276711255924, 50.95675973555245],
[2.526574232640361, 50.95660049785125],
[2.526962910362348, 50.95618996171248],
[2.527654732520641, 50.9555109582227],
[2.524831826656184, 50.95546504708952],
[2.524886628450981, 50.95568258355751],
[2.52488667542939, 50.95569733043608],
[2.524879567054954, 50.95571137647318],
[2.524640852055885, 50.95601182009512],
[2.524645905230251, 50.9560146406423],
[2.52460760194092, 50.95605670516066]
]
]
}
}

To correct this error, the code is slightly different as we need to delete the first and last edge. We then proceed to close the geometry by duplicating the first edge and placing it at the end.

# Fix polygon with incorrect edges
if edges[1] - edges[0] != 2:
list_coord = [list_coord[edges[0]], list_coord[edges[1] + 2]]

if row_to_insert["geometry"]["type"] == "Polygon":

if len(row_to_insert["geometry"]['coordinates'][0]) == edges[1] + 3:

for idx in range(len(row_to_insert["geometry"]["coordinates"])):
coord_list = list(map(lambda x: list((float(f"{x[0]:.15f}"), float(f"{x[1]:.14f}"))),row_to_insert["geometry"]["coordinates"][idx]))
if [coord_list[edges[0]], coord_list[edges[1] + 2]] == list_coord:
row_to_insert["geometry"]["coordinates"] = list(
row_to_insert["geometry"]["coordinates"]
)
row_to_insert["geometry"]["coordinates"][idx] = list(
row_to_insert["geometry"]["coordinates"][idx]
)
#delete first edge
row_to_insert["geometry"]["coordinates"][idx].pop(0)
# delete last edge
row_to_insert["geometry"]["coordinates"][idx].pop(-1)
# duplicate first edge
row_to_insert["geometry"]["coordinates"][0].append(
row_to_insert["geometry"]["coordinates"][0][0]
)

Conclusion

These corrections allow us to retrieve open data, which is not always reviewed and can comprise errors like the ones listed above. This methodology improves our ability to provide the best available data to our clients with the best performance.

--

--