Geospatial Open Data. What’s it like?

Mykola Kozyr

Follow

Published in

Aspectum

5 min readSep 11, 2019

--

Analyzing ArcGIS Hub Open Data

I have always wondered what exactly data people usually work with. What could be a better way to describe data than by actually collecting and processing data about data?

It seems to me that there is no better motivation for researching and writing this article than limitless curiosity. However, I’m also sure that information from it will be valuable and helpful to those who take part in developing data-related software or in planning storage capabilities.

Data

I have collected information about all ArcGIS Hub items: Layers, Maps, Documents, and Images. 184406 objects altogether. The following information is available for all items:

name
owner
date created
type of object
number of features
link to the source
geographical extent
geometry type
attributes names and data types

In this article, I would like to present some of the most interesting insights discovered from the collected data. If you would like to dig deeper and explore it on your own, feel free to check out the whole dataset here.

For all of the visualizations below, except geographical extent, I have filtered and used information about items with dateType equals “Layer” (132151 features).

The dataset is collected on July 16, 2019.

Rows

Probably the first thing to ask is how many features does a regular dataset have? Here is the distribution of a number of features per dataset.

Among all the datasets, there are top 5 datasets by the number of features:

Microsoft Building Footprints-Features — 125 192 184
CalTRANS LaneClosures — 32 425 950
WisDOT Waze Traffic Jam History — 18 944 990
Recintos2016 — 16 056 367
Service Road — 15 112 592

Attributes

Let’s continue discovering insights by moving from information about rows to the information about columns. With no additional words — just questions I wanted to answer.

Question 1 — How many columns does a regular dataset have?

Question 2 — Which data types are the most common?

Question 3 — What are the most popular attribute names?

A word cloud with the most common attribute names (only letters and numbers, lowercase)

Seems like the Platform itself adds some column names to the datasets that users upload (personally, I would never use ‘shapelength’ columnname in data). I’ve decided to avoid using them in the word cloud, so the following image shows the most popular column names, except the ones from the stoplist:

stop = ['objectid', 'fid, 'id, 'globalid', 'shape', 'shapelength', 'shapelen', 'shapestlength', 'shapearea', 'shapestarea']

Geometry Types

At this point, we are getting closer to the geospatial part.

I’ve had a hypothesis that the proportion between geometry types in data could not remain the same. The second hypothesis was that the points representation of data was much more popular than other geometry types.

As you can see below, I was totally wrong. The first chart shows normalized geometry types distribution over time.

Polygons geometries are the most popular over time. However, points and lines are quite popular as well.

There are some strange patterns at the beginning of the chart due to a small number of datasets, so the next chart shows the distribution based on the absolute number of datasets uploaded at each time interval.

Extent

Open data coverage is definitely one of the most interesting things to discover. In this article, we are presenting only the visual representation of it by adding the layers’ extents to the map. Here you can see screenshots from Aspectum showing different regions based on the global visualization of open data layers extents.

You may notice a big concentration of datasets in Western Europe. It is interesting that specific regions in Spain and France are well recognized here. Unfortunately, Ukraine, Belarus, Moldova, and Balkan countries do not contribute to Open Data Hub a lot

North America is definitely the champion for the number of open data on the portal. Just a quick reminder — the styling parameters are the same for all regions

Specific regions in Taiwan and Japan are well-recognized

New Zealand looks great here. Even though it is close to the extent of global datasets (180,-180 longitude line), it remains clear that the bunch of layers belong to New Zealand

Kenya turned out to be one of the global leaders in Open Data Hub, and it is well-recognized here

Contributors

The last, but not the least important question to ask is who has contributed to ArcGIS Open Data Hub the most? This is a list of top 5 contributors per each category.

Layers:

SouthVancouverIslandProsperityProject — 6496
plfran_umich — 1330
ONSGeography_data — 1182
idecvingt — 1076
BIOS_Admin — 1046

Maps:

nicolas.lopez.cundina — 685
uneplive_team — 278
esrichinahk — 253
Esri_BeLux_Content — 252
NIED_CRS@bosai — 199

Images:

lucy.dixon — 270
sjanzen and mdimapdatacatalog — 210 each
PKabatha — 110
SouthVancouverIslandProsperityProject — 106

Documents

PrefeituraRio — 3131
openMONCTONouvert — 2528
esrica_apps — 1213
LakeCounty_Illinois — 1052
cohendri_sarco — 998

Summary

Just as I have expected at the beginning of my research for this article, I found tons of valuable information. It helps me understand users of GIS SaaS better. It is the next step to define the target audience for the scale of the data used, as well as to set up appropriate default options for showing labels and styling parameters. So, is it about data-driven software development?