Geospatial Machine Learning: Structuring Unstructured, Structured Data

Shay Strong
2 min readMay 23, 2019

--

My first, out-of-body-moment reaction to structured vs. unstructured data occurred in the fall of 2016. I was invited to participate in a panel at the IBM Watson developers conference. I met with a group of 6 people, including the panel moderator, to discuss elements of scaling machine learning workflows and hash through our list of panel topics. One of the topics was to discuss the idea of structured vs. unstructured data and how our respective companies use these data types. I think I have been plagued by this topic ever since.

Honestly, structured vs. unstructured buzzwords start to sound a lot like our trendy friend ‘BigData’: weighty but meaningless terms. Internally, I started a fragmented feedback loop in my brain: ‘Do I have structured or unstructured data as a geospatial-based company?’ My brain has been digesting this for the last 3 years.

What I feel like when I think too much about unstructured vs. structured data.

Inherently, geospatial data has a structure. There is an underlaying coordinate system for (all?!) geospatial records: a way to map the data back to a latitude and longitude point on the Earth. Address/parcel (vector) data is structured in polygons or points with geo-awareness with known shapefile or geojson schemas. Imagery (raster) data are pixels captured by a digital camera system. Each pixel may be geo-registered to the ground. The pixels within the raster are quantized elements in spectral (e.g. red, green, blue, infrared) space, limited to a certain electronic well-depth, and constrained in signal responsivity by wavelength via optical filters, bandpasses, and/or the inherent properties of the CCD or CMOS semi-conductor detectors.

(Top) The black & white world of structured and unstructured data. (Bottom) My geospatial world of starting with structured data, asking unstructured questions, and using machine learning to structure the answers over time and content.

This is all to say that the geospatial imagery and vector information itself is structured in capture but not in content. When you get to the level of extracting information from an image, you may know where that pixel maps to on the ground, but you don’t know what it contains beyond being a digital, electronic signal. It is the machine learning, and particularly deep learning techniques, that enable us to scale the content extraction in a meaningful way. We start with structured (captured) data and ask unstructured questions of that data. We then structure the information content that can fulfill those questions at scale.

Oh, and we structure this content not just once! We structure it as a function of time, regional location, complexity (are we looking to just find something? Or categorize it?), and resolution. We are structuring unstructured, structured content. Damn I wish there was a cooler way to say that.

--

--

Shay Strong

VP Analytics at ICEYE. Data Scientist + Artist + Astronomer @shaybstrong