Structuring Unstructured Data

Jeff Jonas
JeffJonas
Published in
2 min readFeb 20, 2018

When asked about unstructured data this is all I have to say:

“Unstructured data is only useful if structure can be extracted from it.”

Let me explain: A picture taken in pitch black without a flash is useless as it contains no discernible features. The mobile phone call that suddenly goes bonkers and becomes all garbled is equally useless as there is no way to extract meaning from the noise.

On the other hand, a parking garage video has the potential to be much more useful because license plate reading software can extract plate numbers. Combine this with lat/long and date/time (metadata), and this becomes a truly useful observation.

The principle that observations are only useful if features can be extracted from them has helped me simplify system architectures:

Observe ->Feature Extract ->Contextualize ->Decide ->Act

When an observation arrives pre-structured e.g., a database transaction, the Feature Extract step is skipped. Because all inputs to Contextualizing are structured, Contextualization processing can be streamlined — indifferent to the nature of the original observation (structured or unstructured).

Some common feature extraction algorithms you may have heard of:

Optical character recognition e.g., converting a picture of words into a text document

Object recognition e.g., detecting pictures of cats

Facial recognition e.g., unlocking the iPhone 10 without a password

Acoustic fingerprinting e.g., detecting an artist/song based on a small audio sample

Named entity recognition e.g., suggesting a new contact based on an email’s contents

Unfortunately, commercially available feature extraction technology has a long way to go. The error rates are often just too high. As a consequence, downstream processes (e.g., Entity Resolution) become the victim. Technology breakthroughs in the field of unstructured feature extraction is much needed. I keep waiting — come on already.

--

--

Jeff Jonas
JeffJonas

Jeff Jonas is founder and CEO of Senzing. Prior to Senzing, Jonas served as IBM Fellow and Chief Scientist of Context Computing.