The Algorithm for Ranking the Segments of the River Network for Geographic Information Analysis Based on Graphs

Introduction

Mikhail Sarafanov
The Startup
11 min readAug 7, 2020

--

The topic of this article is the application of information technologies in environmental science, namely, in hydrology. Below is a description of the algorithm for ranking rivers and the plugin we implemented for the open-source geographic information system QGIS.

An important aspect of hydrological surveys is not only the collection of information received from research expeditions and automatic devices but also the analysis of all the obtained data, including the use of GIS (geoinformation systems). However, exploration of the spatial structure of hydrological systems can be difficult due to a large amount of data. In such cases, we cannot do research without using additional tools that allow us to automate the process.

Visualization plays an important role when working with spatial data. Correct visual representation of the results of the analysis helps to better understand the structure of spatial objects and to know something new. For the image of rivers in classical cartography, the following method is used: rivers are represented as a solid line with a gradual thickening (depending on the number of tributaries that flow into the river) from the source to the mouth of the river. Moreover, segments of the river network often need to be ranked by the degree of distance from the source. This type of information is important not only for visualization, but also for a more complete perception of the data structure, its spatial distribution, and subsequent processing.

The problem of ranking rivers can be illustrated as follows (Fig. 1):

Figure 1. Ranking rivers task, numbers indicate the attribute assigned to each segment of the river network for the total number of tributaries flowing in

Thus, each segment of the river needs to be mapped to a value that shows how many segments flow into this section.

Modern GIS, such as ArcGIS or its open-source competitor QGIS, have tools for working with river networks. However, the river ranking tool requires a large number of additional auxiliary materials and, as it seems to us, unnecessary transformations. For example, for an existing GIS tool to start working with river networks you need to prepare a digital elevation model. A significant disadvantage, in addition to complex and multi-stage data preparation, is the inability to use already prepared vector layers with a river network for analysis, which limits the possibility of using digital bases from open sources (OpenStreetMap or Natural Earth).

Of course, you can assign attribute values to segments without using algorithms, but this approach is no longer relevant if you need to rank the network with several thousand segments.

We decided to automate this procedure by representing the river network as a graph and then applying graph traversal algorithms. To simplify the user’s work with the implemented algorithm, a plugin for the QGIS geoinformation system — “Lines Ranking” was written. The code is distributed freely and is available in the QGIS repository, as well as on GitHub.

Installation

The plugin requires QGIS version >= 3.14, as well as the following dependencies: python libraries — networkx, pandas.

For Linux:

$ locate pip3

$ cd <your system pip3 path from previous step>

$ pip3 install pandas

$ pip3 install networkx

For Windows:

In command line OSGeo4W:

$ pip install pandas

$ pip install networkx

Using

Input data is a vector layer consisting of objects with a linear geometry type (Line, MultiLine). Custom attributes are stored in the input layer to the output layer.

We do not recommend using a field named “fid” for the input layer. At the stage of connecting gaps in the river network, using the built — in module of the GRASS package- v.clean where this field name is the “system” one.

Also, an obligatory input parameter is a point (Start Point Coordinates) that determines the position of the mouth of the river network. It can be set from the map, from a file, or from a layer uploaded to QGIS. The position of the river mouth can be approximate. The calculation is based on the segment of the river network closest to the point (the closing vertex of the future graph).

Optional input data:

  • the threshold for “tightening” gaps in the river network (Spline Threshold). This operation involves the use of a package of GRASS for QGIS. If the specified package is missing, you should fix the gaps in another way and leave this field empty;
  • custom field names for the output layer. Allows the user to assign a field name to record the rank of each segment (Rank fieldname), the number of tributaries (Flow field name), and the distance from the mouth in meters (Distance field name). If parameters are not set , the default names of the fields are Rank, Value, and Distance;
  • location of the output file. If this parameter is omitted, a temporary layer will be created and added to the QGIS layer stack;

We can define the task performed by the algorithm as follows: compare the total number of tributaries flowing into each segment of the river, calculate the number of tributaries for each segment, as well as the distance of the farthest point of the segment from the mouth.

Algorithm description

In GIS, data can be presented in two main formats: raster and vector. A raster is a matrix where a certain parameter value is stored in each pixel. Satellite images, reanalysis grids, various output layers from climate models, and others are often represented in raster format in environmental science. Vector data is represented as simple geometric objects, such as points, lines, and polygons. Each object in a vector format can be associated with some information in the form of attributes. All the actions described below will be performed on the vector layer of the river network.

As a result, the algorithm returns a vector layer in which each object is assigned attributes that determine the distance of segments from the river mouth and the total number of tributaries that flow into this segment.

Preprocessing input data

Note that the topology of the original vector layer may be corrupted. The reason for this may be export / import between different GIS, incorrect file creation, etc. Corrupted layer topology can be expressed in the absence of connections between objects, i.e. the formation of various breaks (Fig. 2), creating additional closures, intersections, etc.

Figure 2. Corrupted topology of vector objects

Therefore, the first stage of preprocessing is to correct the topology of objects: “tighten” the nodes, make the original vector layer consistent. To do this, use the tools from the data analysis panel in QGIS — “Fix geometries” (fixgeometries built-in) and v.clean (from the GRASS package).

After the topology is fixed, the layer is divided into segments at the points where the lines have intersections. The result after splitting is illustrated below (Fig. 3).

Figure 3. Split linear objects into segments

Thus, using the “splitwithlines” tool in QGIS, we divide the source layer into segments.

For each segment, we use QGIS to calculate the length and enter data in the attribute table of the layer (Fig. 4). the segment Length is calculated according to the user settings of the project (Project -> Properties -> General -> Ellipsoid).

Figure 4. Calculating the length of segments

After that, using the “line intersection” tool (built-in tool), we get a point vector layer, where information about segment intersections is set in the attribute table. This attribute table can be interpreted as an adjacency list.

The preprocessing steps are shown in the following image (Fig. 5).

Figure 5. Steps for preprocessing a vector linear layer to represent it as a graph

As a result of preprocessing, the graph is formed as a mathematical object of the networkx Python library. Thus, the river segments are vertices in the graph. If the segments are connected to each other (they have intersections), then there are edges between the graph vertices.

Algorithm for ranking linear objects

After the graph is formed, we know which vertex to start the search from (the point where it flows from the main river into the lake or sea). Let’s call this vertex the closing one, since all other segments of the river network (vertexes) “flow” into it. We have divided the algorithm into several parts:

  1. Ranking graph vertexes by distance from the closing segment, assigning the “rank” attribute (a measure of the segment’s distance) and the “offspring” attribute (the number of sections of the river network that flow directly into this segment);
  2. Assigning the “value” attribute for the total number of tributaries flowing into a given section and the “distance” attribute (the distance of the segment’s extreme point from the mouth in meters).

Both stages are divided into several blocks, but the main idea is a two-stage scheme for assigning attributes.

The assignment of the attributes “rank” and “offspring”

The first stage of graph traversal is to rank vertexes by the degree of distance from the closing one. We planned to carry out the assignment of the attribute “rank” with iterative breadth-first search (BFS). Thus, starting from the closing vertex, we would move further and further away at each step, and at the same time, we would assign an attribute “rank”. But in this case, the following conflict may occur (animation below).

And what rank should we assign to this segment? There may be other problems with this attribute assignment algorithm, but we have listed one of the most vital.

Suggested solution: we can determine the ranks for some part of the river network (the main river), and rank segments based on this information. This approach can be seen on the following picture (Fig. 6).

Figure 6. Conflict resolution by introducing a reference route in the graph

Thus, subgraphs can only join the reference route through 1 edge, and the other edges are excluded.

This raises the following problem : how can we find such a route? — We will assume that the shortest route between the two most distant vertices in the graph, one of which is the closing one, is the reference route (we will soon add an update so that the user can set this route if they have knowledge of which river is the main one, but in the absence of such information, the reference route is determined by this way). This route can be obtained using the A* (A-star) algorithm, but this algorithm works with a weighted graph, and there are no weights on the edges of our graph yet. But we can set weights for the edges of the graph based on the segment lengths (we calculated them earlier).

  • Assigning weights to the graph edges based on the lengths of segments. Simultaneously with this stage, one component is selected in the graph. The movement of the column is carried out using a breadth-first search. The assignment of weights can be demonstrated by the following figure (Fig. 7)
Figure 7. Assigning weights to graph edges

Thus, each vertex of the graph has the “length” attribute, which indicates the length of this segment of the river in meters. We also move attribute values from the graph vertices to the edges iteratively, starting to traverse the graph using BFS from the closing vertex.

This task is performed by the following function, where are

  • G — graph
  • start — the vertex from which to start traversal
  • dataframe — pandas dataframe with 2 columns: id_field (segment/vertex ID) and ‘length’ (the length of this segment)
  • id_field — the field in the dataframe to use for mapping IDs to graph vertexes
  • main_id — index for the main river in the river network (default = None)
  • A* (A-star) search for the shortest path on a weighted graph between the closing vertex and one of the most distant vertices (a segment in the river network). This shortest route between the two most distant vertices in the graph is called “reference route”;
  • Ranking by distance of all vertexes in the reference route. Vertex, from where we start the traversal, the value is assigned rank 1, the next vertex is 2, then — 3, etc.
  • Iterative traversal of a graph with the beginning at the vertices of the reference route with isolation of the considered subgraphs. If one of the graph branches already has a connection to the vertices of the reference route, the edges that link the subgraph to other reference vertices are removed.

You can see the source code here

  • G — graph
  • start — the vertex from which to start traversal
  • last_vertex — one of the farthest vertices from the closing one in the graph

You can see a demonstration of the described algorithm in the animation below.

Thus, the reference route can be considered as the main river in the river network, when all other segments are tributaries to the main one.

Moreover, this approach allows you to achieve good results when ranking rivers in “difficult” places, such as the mouth, where some branches first depart from the main river, and then, gaining new tributaries, flow into the main river again.

Assigning the “value” and “distance” attributes

So, on the graph, all vertices are assigned the values of the “rank” and “offspring” attributes.

If the vertex has no offspring, it means that no tributaries flow into this segment of the river network. Therefore, this vertex must be assigned the value “value” — 1. Then, for each node that has descendants (the rank of the descendants is always 1 less than the rank of the considering vertex) with the value “value” equal to 1, we need to count the number of descendants. The sum of “value” of all descendants of the considering vertex — the “ value” for considering vertex. Then, this procedure is repeated for other ranks.

Thus, we iteratively move to the closing vertex.

At the same time as assigning the “value” attribute to the graph vertices, the “distance” attribute is assigned, which characterizes the distance of segments from the mouth not by the number of segments that must be overcome to reach the closing one, but by the distance in meters that will need to be overcome to reach the river mouth.

The result of using the algorithm can be seen in figure 8. the Distance of river network segments is shown based on the “rank” attribute and the total number of tributaries is shown based on the “value” attribute.

Figure 8. Results of using the algorithm (OpenStreetMap data, the Distance of river network segments is shown based on the “rank” attribute, the Total number of flowing tributaries is shown based on the “value” attribute”)

Conclusion

As you can see, the presented algorithm allows you to rank rivers without using additional information, such as digital elevation model. The obligatory input parameter, except for the main layer of the river network, is the position of the closing segment (where the river flows into the sea or lake), which can be specified as a point.

Thus, with the minimum amount of required input data, it is possible to obtain structured derived information that characterizes the elements of the river basin using the implemented algorithm.

An open implementation of the algorithm in Python, as well as a plugin for QGIS, can be used by anyone. All processing is carried out by one thread, the user does not need to run all the functions separately.

We are glad to answer your questions and see your comments.

Repository with the code:

Feel free to contact us by e-mail:

--

--