Why Unstructured Data is Having a Moment, Thanks to AI

Alex Gnibus
5 min readFeb 26, 2024

--

Welcome to my newsletter, which I call Drop-In Class because each edition is like a short, fun Peloton class for technology concepts. Except unlike a fitness instructor, I’m not an expert yet: I’m learning everything at the same time you are. Thanks for following along with me as I “learn in public”!

Unstructured data: the thing nobody wanted to deal with, is now something we should deal with.

You know how some things seem to be more trouble than they’re worth, so you put off figuring how how to deal with them? For me, it’s having a front lawn. All it does is grow weeds. Our solution to the front lawn has been to smother the grass with mulch and avoid gardening there.

The front lawn of the data world is unstructured data. It’s obnoxious dealing with documents and images and things that take up a lot of storage and don’t fit neatly into a table to analyze. Many organizations just let their unstructured data sit around, unsure how to manage it or do anything useful with it. Like letting the grass die.

Footage of me saving copies of “Document1” in SharePoint. (Source: GIPHY)

Footage of me saving copies of “Document1” in SharePoint.

But then, BOOM. You know what happened next. Generative AI. You know, GenAI has a funny way of making boring things important. Like data quality. And governance. And unstructured data.

So today’s uncool-but-cool topic is unstructured data. And in the spirit of a drop-in class (like Peloton) I’m going to throw on some music.

Here’s “Pages,” by White Reaper, which reminds me of unstructured data because I’m picturing the guy tearing the pages out of his journal in frustration and that’s how I feel about searching through SharePoint documents.

What is unstructured data? The messy paper trail of being a human

Structured data is the nice, organized spreadsheet you picture when you think of data: Numbers and text that fit in neatly labeled, categorized columns. Like financial transactions and customer records. It’s easy to analyze and easy to store in a database.

Unstructured data, on the other hand, likes to be difficult. It comes in all sorts of formats, like audio, images and video.

It’s an estimated 90% of all the data we generate.

But most of it isn’t used, because unstructured data is hard for machines to work with. Unstructured data is human-centric.

Humans don’t generate numbers, we generate text. We speak, we write, we take pictures. We make TikTok videos. We write in our journals, like White Reaper. We get X-rays taken at the dentist. We send emails and documents and way too many slides.

what structured vs. unstructured data looks like in my head

Unstructured objects like giant video files are expensive to store (I pony up for additional iCloud storage thanks to excessive videos of my cat) and they don’t work in traditional databases so we have to dump them into a lake. And even when you do manage to get it organized, it’s hard to analyze in any useful way.

So that’s why I say unstructured data is like a front lawn. Growing fast, taking up space, annoying to landscape, and not useful unless you put a lot of work into it.

GenAI: the gardener to the rescue

If you ignored your front lawn before, it’s time to get weeding, because we just got an awesome gardener. (AI. The gardener is AI. Thank you for running with my strange analogies).

The unstructured data nobody wanted to deal with is suddenly more relevant. Because it’s exactly what LLMs (large language models) like GPT-3 got really good at: Interpreting and generating unstructured data.

LLMs have made it astonishingly fast and effective to do things with unstructured data that we couldn’t do as efficiently before. It takes your neglected front lawn and somehow digs up hidden gold, discovering useful patterns and uncovering buried information.

How do they do it? With natural language processing (NLP), which is going to be my next topic. It’s how computers work with words. And LLMs can do NLP like never before, with advancements that allow it to better understand context and the nuances of language.

What to do now with this information:

Now that you’re thinking about your unstructured data and the new ways you might actually use it, start investigating what you can do that you were ignoring before. Can you retrieve information faster? Transcribe audio from meetings and summarize what was covered? Analyze patterns in your writing?

Some examples of unstructured data that got exciting again with AI:

  • Knowledge documentation: Imagine all the documents just sitting in SharePoint collecting virtual dust. Not anymore: AI can help you search all those dusty documents. The most common genAI use case I’ve seen so far is a chatbot that helps quickly find relevant details that would have been otherwise buried. For instance, in the medical field, AI helps extract information from electronic health records.
  • Emails: When models crunch through your emails, there’s a lot they can help with. They’re writing your responses for you, deciding whether an email lands in your junk folder, and helping you search through your inbox (although I have thoughts on whether Outlook is effective on that last one).
  • PowerPoints: You were sick of making slides, but now AI is the one making the slides! Now you’re back to never having enough slides! The circle of life.

Extra credit reading:

I oversimplify my topics because this is a 101-level drop-in class. Here are helpful sources to dig deeper:

See you in the next drop-in! We’ll cover natural language processing (NLP) and see how a machine starts working with unstructured text and “reading” your documents.

-Alex G

This article was originally posted on my LinkedIn newsletter, Drop-In Class: Short, fun and entertaining explainers on trending technical concepts. To follow along, you can subscribe here.

--

--

Alex Gnibus

Word nerd in tech | AI/ML product marketer | Analytical acrobat