Geospatial Open Data. What’s it like?

Mykola Kozyr
Aspectum
Published in
5 min readSep 11, 2019

Analyzing ArcGIS Hub Open Data

I have always wondered what exactly data people usually work with. What could be a better way to describe data than by actually collecting and processing data about data?

It seems to me that there is no better motivation for researching and writing this article than limitless curiosity. However, I’m also sure that information from it will be valuable and helpful to those who take part in developing data-related software or in planning storage capabilities.

Data

I have collected information about all ArcGIS Hub items: Layers, Maps, Documents, and Images. 184406 objects altogether. The following information is available for all items:

  • name
  • owner
  • date created
  • type of object
  • number of features
  • link to the source
  • geographical extent
  • geometry type
  • attributes names and data types

In this article, I would like to present some of the most interesting insights discovered from the collected data. If you would like to dig deeper and explore it on your own, feel free to check out the whole dataset here.

For all of the visualizations below, except geographical extent, I have filtered and used information about items with dateType equals “Layer” (132151 features).

The dataset is collected on July 16, 2019.

Rows

Probably the first thing to ask is how many features does a regular dataset have? Here is the distribution of a number of features per dataset.

Number of Features Distribution (10% outlier)

Among all the datasets, there are top 5 datasets by the number of features:

  1. Microsoft Building Footprints-Features — 125 192 184
  2. CalTRANS LaneClosures — 32 425 950
  3. WisDOT Waze Traffic Jam History — 18 944 990
  4. Recintos2016 — 16 056 367
  5. Service Road — 15 112 592

Attributes

Let’s continue discovering insights by moving from information about rows to the information about columns. With no additional words — just questions I wanted to answer.

Question 1 — How many columns does a regular dataset have?

Number of Attributes Distribution (1% outlier)

Question 2 — Which data types are the most common?

Attributes Types Occurance

Question 3 — What are the most popular attribute names?

A word cloud with the most common attribute names (only letters and numbers, lowercase)

Seems like the Platform itself adds some column names to the datasets that users upload (personally, I would never use ‘shapelength’ columnname in data). I’ve decided to avoid using them in the word cloud, so the following image shows the most popular column names, except the ones from the stoplist:

stop = ['objectid', 'fid, 'id, 'globalid', 'shape', 'shapelength', 'shapelen', 'shapestlength', 'shapearea', 'shapestarea']

Geometry Types

At this point, we are getting closer to the geospatial part.

I’ve had a hypothesis that the proportion between geometry types in data could not remain the same. The second hypothesis was that the points representation of data was much more popular than other geometry types.

As you can see below, I was totally wrong. The first chart shows normalized geometry types distribution over time.

Polygons geometries are the most popular over time. However, points and lines are quite popular as well.

Geometry Types Proportions in Datasets over Time

There are some strange patterns at the beginning of the chart due to a small number of datasets, so the next chart shows the distribution based on the absolute number of datasets uploaded at each time interval.

Number of Datasets Uploaded by Geometry Types

Extent

Open data coverage is definitely one of the most interesting things to discover. In this article, we are presenting only the visual representation of it by adding the layers’ extents to the map. Here you can see screenshots from Aspectum showing different regions based on the global visualization of open data layers extents.

You may notice a big concentration of datasets in Western Europe. It is interesting that specific regions in Spain and France are well recognized here. Unfortunately, Ukraine, Belarus, Moldova, and Balkan countries do not contribute to Open Data Hub a lot
North America is definitely the champion for the number of open data on the portal. Just a quick reminder — the styling parameters are the same for all regions
Specific regions in Taiwan and Japan are well-recognized
New Zealand looks great here. Even though it is close to the extent of global datasets (180,-180 longitude line), it remains clear that the bunch of layers belong to New Zealand
Kenya turned out to be one of the global leaders in Open Data Hub, and it is well-recognized here

Contributors

The last, but not the least important question to ask is who has contributed to ArcGIS Open Data Hub the most? This is a list of top 5 contributors per each category.

Layers:

  1. SouthVancouverIslandProsperityProject — 6496
  2. plfran_umich — 1330
  3. ONSGeography_data — 1182
  4. idecvingt1076
  5. BIOS_Admin1046

Maps:

  1. nicolas.lopez.cundina — 685
  2. uneplive_team — 278
  3. esrichinahk — 253
  4. Esri_BeLux_Content — 252
  5. NIED_CRS@bosai — 199

Images:

  1. lucy.dixon270
  2. sjanzen and mdimapdatacatalog — 210 each
  3. PKabatha — 110
  4. SouthVancouverIslandProsperityProject — 106

Documents

  1. PrefeituraRio — 3131
  2. openMONCTONouvert — 2528
  3. esrica_apps1213
  4. LakeCounty_Illinois — 1052
  5. cohendri_sarco — 998

Summary

Just as I have expected at the beginning of my research for this article, I found tons of valuable information. It helps me understand users of GIS SaaS better. It is the next step to define the target audience for the scale of the data used, as well as to set up appropriate default options for showing labels and styling parameters. So, is it about data-driven software development?

--

--

Mykola Kozyr
Aspectum

Product Management and Geospatial Innovations.