Europe From Above: Space-Time Machine Learning Reveals Our Changing Environment

Published in

The Startup

11 min readMar 1, 2021

Prepared by: Leandro Parente (OpenGeoHub), Martijn Witjes (OpenGeoHub), Tom Hengl (OpenGeoHub), Codrina Maria Ilie (Terrasigna) and Martin Landa (CVUT Prague)

Our harmonized land cover product based on spatiotemporal Machine Learning (left) and the existing CORINE land cover product (right).

OpenGeoHub together with partners CVUT Prague, mundialis, Terrasigna, MultiOne and GiLAB, has released Open Data Science Europe data portal on 1st of March 2021. It comprises terabytes of gridded data available as Cloud-Optimized GeoTiffs and served via Geoserver. The project will collate, extend, harmonize, integrate, and distribute a wide range of freely available environmental and administrative data for Europe in the years to come. Find out more how to access and use this data and what are its advantages and limitations.

Layers currently available

Open Data Science Europe data portal hosted at https://ecodatacube.eu/ consists of environmental, land cover, terrain, climatic, soil, and vegetation layers covering the full extent of continental Europe at relatively fine spatial resolutions (30-m to 1-km). It was produced as part of the CEF Telecom project 2018-EU-IA-0095: “Geo-harmonizer: EU-wide automated mapping system for harmonization of Open Data based on FOSS4G and Machine Learning” co-financed by the European Commission. Use it to visualize, download, serve, and share spatiotemporal datasets for the time period 2000–2020 and beyond.

Here are the layers you can presently discover via the Open Data Science Europe viewer:

Annual dominant land cover class based on CORINE (2000–2019);
Annual probabilities and uncertainties for 33 CORINE land cover classes (2000–2019);
Annual Landsat RGB composites for the summer season (June–September for 2000–2019);
Annual Landsat NDVI composite derived by season (winter, spring, summer and fall — 2000–2020);
Density of building areas according to OSM (2021) and Copernicus (impervious built-up — 2018);
Density of commercial, industrial and residential buildings according to OSM (2021);
Harmonized protected areas derived from OSM (2021) and NATURA2000 (2019)
Harmonized administrative areas (county-level) derived from OSM (2021) and NUTS (2021);
Probabilities of 18 OSM land cover classes (2021);
Density of highways and railways according to OSM (2021);
Monthly particulates (PM2.5) predictions (2018);

Annual Land Cover Product (2000–2019)

How were the maps available in the data portal generated? The annual land cover product for continental Europe was generated using Ensemble Machine Learning (EML), which used training samples obtained from other European Commission funded projects, such as the LUCAS (Land Use and Coverage Area frame Survey) and the Copernicus Land Monitoring Service, and several harmonized raster layers (e.g. GLAD Landsat ARD imagery and Continental EU DTM) to predict the dominant land cover, probabilities and uncertainties for 33 classes compatible with CLC (CORINE Land Cover) over 20 years (2000–2019). The workflow, presented in Fig. 1, was implemented in R and Python and is publicly available, under an Apache-2 license, through the eumap library.

*Figure 1. Workflow to generate the annual land cover product (2000–2020) available in the* *Open Data Science Europe* *platform.*

We have first downloaded the LANDSAT ARD, provided by GLAD (Potapov et al., 2020) for the years 1999 to 2020 and for the entire extent of continental Europe (land mask area and tiling system used by the project). This imagery archive was screened to reduce cloud cover, using as reference the quality assessment provided by GLAD, and aggregated by season according to the three different quantiles (25th, 50th and 75th). The images for each season were selected using the same calendar dates for all periods:

Winter: December 2 of previous year until March 20 of current year,
Spring: March 21 until June 24 of current year,
Summer: June 25 until September 12 of current year,
Fall: September 13 until December 1 of current year,

This approach produced 84 images (3 quantiles ✕ 4 seasons ✕ 7 Landsat bands) for each year with different occurrences of gaps/nodata due to cloud contamination in all observations of a specific season. To overcome this problem, a per-pixel approach was implemented to fill all the gaps/nodata, using, as a new value, a median of all observations available in different time windows, prioritizing observation of: 1-the same season, 2-neighboring seasons and 3-all the year. This gap filling approach, called Temporal Moving Window Median (TMWM) and publicly available in the eumap library, was used to produce the Landsat temporal composites used by the ML training and prediction steps.

In addition to Landsat data, we also used data for night lights (VIIRS/SUOMI NPP), Global surface water frequency (Pekel et al., 2016) and Continental EU DTM images (i.e. elevation, slope) to help improve predictive models. The Landsat spectral indices (SAVI, NDVI, NBR, NBR2, REI and NDWI) and the max/min. monthly geometric temperature, estimated on a pixel basis, and for each month, according to Kilibarda et al., 2014 using as input the latitude and the elevation, were calculated on-the-fly during the prediction step, avoiding the overhead to maintain this large amount of data in some storage media.

The training data were obtained from the geographic location of LUCAS (in-situ source) and the centroid of all polygons of CORINE land cover, harmonized according to the 33 land cover classes (table 1) and organized by year, where each unique combination of longitude, latitude and year was considered as a independent sample, resulting in more than 7 million training points. The LUCAS points with a unique land-cover class received a confidence rating of 100%, while CORINE points received 85%. These confidence weights were considered in the ML training step. The points were used in a spacetime overlay approach, which considered the location and the year to retrieve the pixel values of all covariates/features (e.g. for a sample from 2018 the values were retrieved from Landsat composites with reference to the same year — see the eumap code demonstration). Some specific land cover samples (i.e. 111, 122, 131, 141, 211, 221, 222, 223, 231, 311, 312, 321, 411, 512) were screened according to convergence with pre-existing mapping products (OSM roads, railways and buildings; Copernicus high resolution layers), where, for example, “111: Urban fabric” samples located in low density building areas (> 50% according to Copernicus-OSM building layer) were removed from the final training data. These steps produced a classification matrix with ~5.3 million samples and 178 covariates/features (including on-the-fly raster data).

*Table 1: Harmonized LUCAS and CORINE land cover classes (33) used as reference data to train and predict the annual land cover maps (2000–2020).*

Using this classification matrix we conducted a hyperparameter optimization for three ML models (i.e. Random Forest — Breiman, 2001; XGBoost — Chen & Guestrin, 2016; Artificial Neural Network — Mishra & Srivastava, 2014), minimizing the log_loss metric derived from a 5-fold spatial cross validation, based on a 30✕30km tilling system. The best hyperparameters were used to train models able to predict the land-cover probabilities, which served as input to train a linear meta-learner (i.e. Logistic regression classifier — Defazio & Bach, 2014), responsible for predicting the final land cover probabilities of all land cover classes (see the eumap code demonstration). The uncertainties were calculated for the 33 land cover classes according to the standard deviation of the three predicted probabilities for each pixel. The highest probability was selected as the dominant land cover class, resulting in 20 annual maps for continental Europe. All the output layers (20 dominant land cover class, 660 per-class probabilities, and 660 per-class uncertainties) were organized as Cloud Optimized Geotiffs (GOGs) files and are publicly available through the Open Data Science Europe platform and the S3 Cloud Object Service (see short video tutorial and the eumap code demonstration).

A publication describing, in detail, all processing steps and general analysis of land-cover changes in continental Europe is pending. In the coming months we will evaluate this first version of the product, conduct an accuracy assessment and make a list of improvements/fixes based on user-feedback.

Harmonized OSM Product (2021)

The harmonized OpenStreetMap product comprises: roads, railways, buildings, land use, administrative boundaries and protected nature areas for Continental Europe. We extracted data from vector OSM layers provided by GeoFabrik GmBh, and integrated with Copernicus high resolution layers, NATURA2000 sites and NUTS (Nomenclature of territorial units for statistics). The buildings, land use, protected areas layers were categorized according to the following steps:

Extract all possible values of the most descriptive variable in the OSM data;
Sum the number of occurrences and the total area covered by objects of each type over all countries;
Assign each type to a category until at least 99% of objects are categorized.

All layers (filtered and categorized OSM data) were converted to 10m2-resolution rasters, where each pixel located in some vector feature received the value 100. These rasters were then averaged to 30-m of spatial resolution to get a probability/density value.

The list of layers added to the OSM data includes:

Transportation Infrastructure

OSM roads: Car-accessible roads and highways, removing vector features not accessible by car (e.g. footpath).
OSM Railways: All railways.

Building density

Copernicus-OSM buildings: All kinds of buildings according to OSM. To correct missing data in poorly annotated regions, the resulting raster was integrated with the Copernicus Impervious Built-up (IBU) 2018 layer. The pixel values between 0–100 represents the OSM building density, while the inverted range 200–101 represents IBU density (101 is the highest destiny value).
OSM residential buildings: Buildings that primarily serve as homes, such as houses and apartments.
OSM commercial buildings: Buildings where businesses sell products to consumers, such as hotels, retail stores, and entertainment venues.
OSM industrial buildings: Buildings that are primarily used for the production and storage/distribution of goods, such as factories, warehouses, and distribution centers.
NUTS-OSM administrative areas represents all the counties within the project region, delineated using NUTS level 3 and where not available OSM data at a corresponding level. OSM was used for the following countries: Bosnia and Herzegovina; Kosovo; Montenegro; Andorra; San Marino; Crown Dependencies (3 islands: Isle of Man, Jersey and Guernsey in the Irish Sea); Faroe Islands and Monastic State of the Holy Mountain and the Athonite. The resulting layer was rasterized according to hierarchized codes representing regions, countries and counties (all the codes are available here).
Natura2000-OSM Protected areas is an integrated product with differents IUCN areas (International Union for Conservation of Nature) annotated in OSM, and three site types provided by NATURA2000, resulting in a raster layer with the following classes: IUCN 1a: Strict Nature Reserves, IUCN 1b: Wilderness Areas, IUCN 2: National Parks, IUCN 3: Natural Monuments or Features, IUCN 4: Habitat/Species Management Areas, IUCN 5: Protected Landscapes/Seascapes, IUCN 6: Protected Areas with Sustainable use of Natural Resources, Other: An aggregation of protected areas that do not fall under the IUCN system, Category A: Classified Special Protection Site (SPA), Category B: Site under the Habitats Directive (pSCI, SCI or SAC), Category C: Area of pSCI/SCI/SAC is the same as designated SPA, Overlap area: Areas existing simultaneously in OSM/IUCN and NATURA2000.

OSM Land Use consists of 16 rasters based on commonly used categories in the OSM land use labeling system. Some classes were directly rasterized, while some were grouped in categories:

Cemeteries: Places for burials
Construction sites: Sites that are under active development regarding the construction of buildings or structures.
Dump sites: Sites for permanent or long-term storage of waste materials.
Farmland: Areas where crops or flowers are grown, either by businesses or local residents.
Farmyard: Parts of farms where no crops are grown.
Forest: Managed forest or woodland plantation.
Grass: Areas of mown and managed grass.
Greenhouse: Areas used for growing plants in greenhouses.
Harbours: Coastal areas where commercial and civilian traffic is handled.
Meadows: Areas primarily vegetated by grass and other non-woody plants, mainly used for hay or for grazing animals.
Military: Areas used for any military purpose, such as airfields, training areas, and bases.
Orchards: Areas containing intentionally planted trees or shrubs maintained for food production.
Quarries: Areas used for surface mineral extraction
Reservoirs: Man-made bodies of stored water, and areas artificially graded to hold water (basins).
Salt: Areas where sea water is evaporated to extract its salt.
Vineyards: Areas where grapes are grown.

Figure 2: Natura2000-OSM areas as compared to predicted land cover classes of natural vegetation.

Open Data

OpenDataScience.eu is an Open Data project drawing inspiration from the OpenLandMap.org and OpenStreetMap.org projects. If not specified otherwise, the data available on this portal is licensed under the Open Data Commons Open Database License (ODbL) and/or Creative Commons Attribution-ShareAlike 4.0 and/or Creative Commons Attribution 4.0 International license (CC BY). This means that you are free to use OpenDataScience.eu data for any purpose as long as you credit OpenDataScience.eu and its contributors. If the data is available under the CC BY-SA license, this implies that if you alter or build upon the data in certain ways, you may distribute the result only under the same licence.

Data access and download

Accessing data through a file service (Cloud-Optimized GeoTIFFs): a majority of the layers available on https://ecodatacube.eu/ are also available seamlessly through our S3 Cloud Object Service as Cloud-Optimized GeoTIFFs under the Open Data Commons Open Database License (ODbL) license. This means that you can (a) visualize the data and run processing directly using QGIS or similar (see short video tutorial), (b) import, subset, crop and overlay parts of the data for a local area. We do not however recommend downloading whole data sets by using Wasabi.com. To download the complete datasets (the whole of the European continent) you should use zenodo.org or similar. The data is currently being uploaded to public data repositories.

Vector data can be imported into QGIS in a standard way through WFS service using URL:

https://geoserver.opendatascience.eu/geoserver/wfs

Raster layers (see complete list) can be accessed through HTTP service using QGIS (see also the COG tutorial). These are example of layers:

File naming convention is explained in detail here. Download the style file (QML) here.

GDAL Access

The gdal_translate functionality can be used to download data directly from a COG URL. You just need to pass the bounding box information via the “-projwin” parameter, informing the COG url prefixed by “/vsicurl/”:

gdal_translate -co COMPRESS=LZW -projwin 3949019.319534085 3274684.0278522763 3997655.528969183 3247994.5591947753 /vsicurl/http://s3.eu-central-1.wasabisys.com/eumap/lcv/lcv_landcover.hcl_lucas.corine.rf_p_30m_0..0cm_2019_eumap_epsg3035_v0.1.tif lcv_landcover_amsterdam.tif

Python Access

To query the data for any coordinate/point of EU check this Jupyter notebook tutorial.

R Access

To access data from R using the terra package. From here you can use any native operation e.g. to crop some polygon or resample / aggregate values.

Summary points

We have already produced about 10-TB of data representing land cover and land use for Europe at relatively fine spatial resolution of 30-m. Note that the spatiotemporal Machine Learning produces limited accuracy, and many of the patterns visible in the maps are more result of uncertainty of classifier / noise issues in the Landsat images than actual changes in land cover / land use. This is the main criticism we have received so far of the predictions. The spatiotemporal Machine Learning, on the other hand, offers multiple advantages:

A single spacetime model is used to model phenomena such as land cover — making it a holistic approach to data harmonization and prediction;
By using LUCAS points we base modeling and predictions on a consistent and quality-controlled data set that allows for unbiased assessment of land cover dynamics;
In principle, we can the existing model also to predict land cover for years 2020 and 2021—no extra sampling / training points are needed i.e. preparing Landsat images for these periods would be likely enough;
All processes shown above are fully automated, hence we can continue improving the models and re-running predictions;

If you are aware of a similar product, and/or if you discover an efficient way to reduce uncertainty and remove problems in the output predictions i.e. to make them more consistent and to remove highly unlikely sequences in land cover (e.g. pixel changing value from forest to cropland and then back again within < few years) please contact us and join the production group.

We will be using similar framework for mapping potential natural vegetation, forest tree species and dynamic environmental properties in the years to come. If you are interested in this work and would like to follow a dedicated training on the topic of spatiotemporal Machine Learning, please join us for a 5-day workshop in September 2021: https://opendatascience.eu/workshop-2021/

If you experience any technical problems or if you discover a bug in the Open Data Science Europe data portal, please report it here.