Open Data and Urban Trees : a case study by nam.R

Published in

namR

8 min readJun 20, 2019

Who is nam.R?

namR is a French start up that provides public and private organizations with non-personal data to create or accelerate projects in the fields of energy efficiency, renewable energy and risks assessment. Our data science team works with structured data, text, aerial and spatial imagery, coming from open data sources or from one of our many partners, in order to build a homogeneous and normalized database referential. Based on this referential, nam.R brings data and solution to its clients through different platforms.

Why trees ?

namR’s database contains information about buildings, companies, roads or even agricultural parcels. The dataset of trees that are managed by public actors in France is one example of element integrated to our referential.

Moreover, in March 2019, nam.R won a call for proposals from the French Ministry of Ecology. The project is carried by nam.R in partnership with Pouget Consultants, CEREMA, LMD, Institut Louis Bachelier and Comité 21, with the support of the Hauts-de-France region and the French Sustainable Building Plan. It is called tRees, which stands for : TRansition Ecologique des Etablissements Scolaires (Ecology Transition for Schools in French). This project should help schools in the Hauts-de-France region to identify the different renovations that should be initiated in the buildings.

We thought it was appropriate to celebrate this success with the wordplay and share our database on urban trees as open data (it can be downloaded using the link below)

Arbres en open data en France — par nam.R — data.gouv.fr

Ce jeu de données concerne l’ensemble des arbres urbains…

www.data.gouv.fr

A very illustrative example of open data cleansing

Through this publication, we intend to share our experience in working with French open data, to explain what were the major difficulties faced and the solutions we came out with.

Urban Trees around Champs-Elysées in Paris

The first difficulty is to identify open data sources. Currently, there is no French database recapitulating all the trees managed by public institutions in France. However, numerous public organizations have published datasets concerning the trees they manage, which are published on their own open data portals. Best sellers like corporate registry and land use databases are very useful but incomplete. That is why nam.R developed its own software to explore open data: the Data Library. This tool is meant to improve and accelerate data sourcing. It crawls all the open data portals that we have identified in France (~500 portals) and gets both the data and its associated metadata. Mining operations generate even more standardized metadata on the data that is stored in the library. It is on a wide database of more than 500k files metadata that we searched for the datasets containing open information concerning trees in France.

This way, we have identified more than 30 relevant datasets. Some of them did not meet some basic requirements, such as having information on the tree’s position. Ultimately, only 25 of them were suitable, and contained information such as the location of each tree and sometimes genus, height, width, year of plantation and so on.

*Most popular urban tree genus in France (from left to right) : (1)* *Platanus occidentalis, (2)* *Acer saccharum,* *(3)* *Tilia tomentosa*

Even with this filter, we have had to conduct an important work of normalization of information. For instance, the height information was delivered in a wide variety of forms: categories, integer and decimal numbers. The variability was higher when focusing on the species of trees. Interestingly, we found that most of the data sources contained information about the trees’ genus (a taxonomic rank used in the phylogenetic classification between species and family). However, we faced different problems in processing this information:

level of precision: some datasets contained the trees’ species while others were limited to their genus (which is a lower level in the phylogenetic classification).
language: some genus were specified in French and others in Latin
uncertainty about the classification used for the trees: biology studies are still active in classifying the different species and different versions of phylogenetic trees exist

We chose to take into account the most used classification (according to Wikipedia) : APG III. We then mapped the different species based on a dictionary that translates French to Latin and vice-versa. We also chose to reduce the level of details to the trees’ genera because we often lacked details on species. Additionally, the existence of different spelling possibilities (e.g. Chene,Chêne,Chênes…) and numerous mistakes (e.g. arbustus → arbutus) in the datasets has made this work even harder.

Tree Dataset Analysis

When integrating the tree data coming from multiple sources, we faced the issue of counting the same tree multiple times, as different datasets may overlap in some regions. For instance, the Île-de-France region, the Grand Paris Sud and the OpenStreet Map datasets all share departments in the southern Paris area. As one could expect, the tree identification label is not the same across these sources and the position precision is not exact, so a direct comparison is not possible. The solution found was to filter out trees that are too close together based on a fixed distance threshold, assuming these represent the same tree.

Urban Trees around Trocadero and Champ-de-Mars (Paris)

At namR, we make extensive use of Postgis, an open source software program that adds support for geographic objects to the PostgreSQL object-relational database.

The Postgis function ST_DWithin, which checks if the distance between two geometries (point, line or polygon) is smaller than a given value, is well fitted for our application as we only need to find trees having a closest neighbor distance smaller than a fixed threshold. At first sight, this operation might seem very costly, as one would need to compute the distance between every pair of trees (from a 2 million-entries database, that would result in 4x10¹² operations). The spatial indexing can be very useful to make an index from a geometry. As mentioned in the post gis documentation, “without indexing, any search for a feature would require a sequential scan of every record in the database. Indexing speeds up searching by organizing the data into a search tree which can be quickly traversed to find a particular record.”

Tree density in Paris (left), Lyon (middle) and Bordeaux (right). For visualization purposes, the scale is limited to 500 trees/iris.

Indexing the geometry is useful not only when we want to calculate distances but for all types of geometric comparisons. One specificity of the spatial indexing of postgis is that the bounding box of the geometry is used instead of the geometry itself (for a more technical discussion, look for the R-tree algorithm on your favorite search engine, such as duckduckgo or qwant).

In our previous post, 10 Essential Operations for Spatial Data in Python, we have shown how to use the Python’s scipy library to organize geometries with a KD-Tree algorithm to speed up K-Nearest Neighbours operations.

Number of urban trees per IRIS (a division used in France for statistical purposes, corresponding to a neighborhood or a small city). For visualization purposes, the scale is limited to 100 trees/iris. We emphasize that this visualization is not a representation of the reality of urban trees, but rather of the state of open data on this topic

After excluding the duplicated entries, we obtained a little more than 2 million trees distributed all over the 96 departments of metropolitan France. It is worth noting that, before the addition of the Open Street Map dataset, only 30 departments were represented. As shown in the map above, the data is clustered around some regions corresponding to big cities, specially around the Paris area.

Besides the coordinates and municipality information of each tree, we also have attributes such as the height category and the tree genus.

*Most popular urban tree genus in France*

With respect to the tree genus classification, only a quarter of the data have this attribute, depending on the source. We have more than 250 different genera represented, the most common types being Platanus, Acer and Tilia, as illustrated above.

Finally, our dataset also contains tree height data when this information is present in the source. Urban trees tend to have a height lower than 30 meters with a median value around 8 meters. Naturally, different tree genera present different size ranges, with Prunus being one of the smallest trees widely found in France (and Platanus one of the biggest).

Working with data from multiple sources has some limitations, specially when information on data quality is not available. Particularly, this becomes apparent when we find that some trees in the dataset have a height of over 1000 meters and some others even have null or negative values. In order to assess the quality of every dataset we create (including this one), there is always a column containing the confidence of the data. We have, for example, downgraded confidence to the lowest level for every tree taller than 50 meters or with a null value. Our previous post, Why data quality matters?, explains the confidence system used at namR and why it is important.

Urban trees tend to have a height lower than 30 meters with a median value around 8 meters. The peaks we see in the distribution (at 10, 15, 20 and 25 meters) is related to the bias in the way different sources provide the data, as it is more easy to round values

Conclusion

Overall, the creation of the “trees of France” referential seems to be an excellent case study to show how specific it is to work with open data. Even though datasets are openly available through relevant formats, it is still very tricky to cleanse them and merge them within a normalized table.

This dataset can be used in various applications such as air quality management, urban vegetation, tourism (to find very rare species) or road management (knowing which road is close to trees can be useful whenever the wind blows a lot!).

This dataset can also be useful as a label to data scientists aiming to develop algorithms to detect trees in aerial images. At nam.R, we are convinced that open data is a very interesting source of labels for machine learning algorithms (as we already discussed in this blog post : https://medium.com/nam-r/deep-learning-for-roof-detection-in-aerial-images-in-3-minutes-8d845f6d7f25).

If you found this article interesting do not hesitate to visit our website (https://namr.com) and look for open positions in our company (https://www.welcometothejungle.co/fr/companies/nam-r) !

Open Data and Urban Trees : a case study by nam.R

Arbres en open data en France — par nam.R — data.gouv.fr

Ce jeu de données concerne l’ensemble des arbres urbains…

Written by namR