The Myth of Unstructured Data

Matthew G. Johnson
DataSeries
Published in
4 min readFeb 10, 2019
Figure 1 — Alice and the White Rabbit

“Unstructured data is a myth, I tell you”, said the White Rabbit, “It never existed, and it never will!”

“You’ve gone bananas!”, retorted Alice. “Everyone knows that’s not true.”

“Oh dear! Oh dear! I shall be late!”, he moaned as he dove into the rabbit hole. “But you can’t go now!”, cried Alice, “We haven’t finished talking.”, and ran after the White Rabbit. At first, she could not see a thing; the tunnel was far too dark and dusty. Finally, she made out the faint image of a young boy with a Rubik’s cube sitting in a corner seated at a table. He was turning the blocks and trying to put all the colours in order. He was becoming increasingly annoyed, and his movements were becoming faster and more erratic. This was not going to end well. A couple of moments later, he slammed the scrambled cube down on the table shouting, “this is stupid, stupid, stupid!” and stormed off into the tunnel. Alice pondered the situation for a moment and then crawled onward to look for the White Rabbit.

As she passed the next bend, the tunnel widened into a large hall filled with rows upon rows of hard wooden benches. There was an old man with a silver beard standing at the front. Meanwhile at the back she spotted the White Rabbit and as she sat down beside him, the man started to speak. “There are two main types of data: structured and unstructured.” With a long wooden pointer, he tapped on a large placard behind him.

Figure 2— The Orthodoxy

“Ah ha!”, shouted Alice, “I knew that unstructured data exists, and this proves it!”

“Shush!” whispered the White Rabbit, “Please be respectful of the Professor.” As the man put up the next placard, she started to look a little more closely.

Figure 3— The Example

The list of names surely had a structure, but then again so did the text and the photograph; perhaps even more so.

Figure 4— The Complex Reality

As the man finished his lecture, Alice looked puzzled. She turned around and asked the White Rabbit, “If there is so much structure in these texts and images, why did the Professor call them unstructured?”. The White Rabbit replied, “Did you see that boy trying to solve the Rubik’s cube? When he called it stupid, which do you think was more stupid?”

“The boy, of course!”, shouted Alice, “Anyone could see that. He just didn’t know how to solve the puzzle”

“So, if the professor, found a puzzle with complex structured data that he couldn’t solve, do you think he would be humble enough to admit it? Or perhaps like the boy he might just blame the puzzle?”

“Ah, I see!”, shrieked Alice, “So the Professor called those texts and images unstructured, not because they were, but because he didn’t know how to solve the puzzle of how to analyse them with his computer!”

“Eureka!” exclaimed the White Rabbit.

Author’s Notes

Computer science has long made the mistake of classifying data as either structured and unstructured. For every computer scientist who suggests that text or images are unstructured, you will find at least two scholars who will explain the intricate structure of Shakespeare’s sonnets or da Vinci’s paintings.

It is a fair observation that computers of the twentieth century were unable to comprehend the structure of language and images, but it is a fundamental mistake to confuse the failure to comprehend structure with the absence of structure.

As computers of the twenty-first century find structure in data where those of the twentieth found none, it is time to retire the misguided language of the past and find more representative language for the future. Let us recognise the structure in both forms of data but realise that one is fundamentally more complex than the other. Perhaps we should use these terms in future?

· Simply-structured data: data in regular tabular and relational formats

· Complex-structured data: data in complex natural structures, such as language, images and sounds.

--

--

Matthew G. Johnson
DataSeries

I am an informatician, fine arts photographer and writer who is fascinated by AI, dance and all things creative. https://photo.mgj.org