Geospatial Open Data. What’s it like?
Analyzing ArcGIS Hub Open Data
I have always wondered what exactly data people usually work with. What could be a better way to describe data than by actually collecting and processing data about data?
It seems to me that there is no better motivation for researching and writing this article than limitless curiosity. However, I’m also sure that information from it will be valuable and helpful to those who take part in developing data-related software or in planning storage capabilities.
Data
I have collected information about all ArcGIS Hub items: Layers, Maps, Documents, and Images. 184406 objects altogether. The following information is available for all items:
- name
- owner
- date created
- type of object
- number of features
- link to the source
- geographical extent
- geometry type
- attributes names and data types
In this article, I would like to present some of the most interesting insights discovered from the collected data. If you would like to dig deeper and explore it on your own, feel free to check out the whole dataset here.
For all of the visualizations below, except geographical extent, I have filtered and used information about items with dateType equals “Layer” (132151 features).
The dataset is collected on July 16, 2019.
Rows
Probably the first thing to ask is how many features does a regular dataset have? Here is the distribution of a number of features per dataset.
Among all the datasets, there are top 5 datasets by the number of features:
- Microsoft Building Footprints-Features — 125 192 184
- CalTRANS LaneClosures — 32 425 950
- WisDOT Waze Traffic Jam History — 18 944 990
- Recintos2016 — 16 056 367
- Service Road — 15 112 592
Attributes
Let’s continue discovering insights by moving from information about rows to the information about columns. With no additional words — just questions I wanted to answer.
Question 1 — How many columns does a regular dataset have?
Question 2 — Which data types are the most common?
Question 3 — What are the most popular attribute names?
Seems like the Platform itself adds some column names to the datasets that users upload (personally, I would never use ‘shapelength’ columnname in data). I’ve decided to avoid using them in the word cloud, so the following image shows the most popular column names, except the ones from the stoplist:
stop = ['objectid', 'fid, 'id, 'globalid', 'shape', 'shapelength', 'shapelen', 'shapestlength', 'shapearea', 'shapestarea']
Geometry Types
At this point, we are getting closer to the geospatial part.
I’ve had a hypothesis that the proportion between geometry types in data could not remain the same. The second hypothesis was that the points representation of data was much more popular than other geometry types.
As you can see below, I was totally wrong. The first chart shows normalized geometry types distribution over time.
Polygons geometries are the most popular over time. However, points and lines are quite popular as well.
There are some strange patterns at the beginning of the chart due to a small number of datasets, so the next chart shows the distribution based on the absolute number of datasets uploaded at each time interval.
Extent
Open data coverage is definitely one of the most interesting things to discover. In this article, we are presenting only the visual representation of it by adding the layers’ extents to the map. Here you can see screenshots from Aspectum showing different regions based on the global visualization of open data layers extents.
Contributors
The last, but not the least important question to ask is who has contributed to ArcGIS Open Data Hub the most? This is a list of top 5 contributors per each category.
Layers:
- SouthVancouverIslandProsperityProject — 6496
- plfran_umich — 1330
- ONSGeography_data — 1182
- idecvingt — 1076
- BIOS_Admin — 1046
Maps:
- nicolas.lopez.cundina — 685
- uneplive_team — 278
- esrichinahk — 253
- Esri_BeLux_Content — 252
- NIED_CRS@bosai — 199
Images:
- lucy.dixon — 270
- sjanzen and mdimapdatacatalog — 210 each
- PKabatha — 110
- SouthVancouverIslandProsperityProject — 106
Documents
- PrefeituraRio — 3131
- openMONCTONouvert — 2528
- esrica_apps — 1213
- LakeCounty_Illinois — 1052
- cohendri_sarco — 998
Summary
Just as I have expected at the beginning of my research for this article, I found tons of valuable information. It helps me understand users of GIS SaaS better. It is the next step to define the target audience for the scale of the data used, as well as to set up appropriate default options for showing labels and styling parameters. So, is it about data-driven software development?