You are only human. How topic models can help direct the deluge of data

Catrinel Bartolomeu
Oz Content
Published in
6 min readAug 13, 2015

--

About a decade ago, my father started downsizing. The first place I saw the evidence was his desk. The loose magazines were gone and so were the bookshelves, paper trays, and binders. But I guess the severity of the situation didn’t really hit me until one day I found myself in his closet, where he had compiled perfectly folded rectangular stacks of t-shirts, shorts, and long sleeve shirts, each of approximately the same size and quantity.

He’d eliminated all excess.

No item had more than 9 of its kind. 9 salmon-colored crew-neck polos. 8 white fitted t-shirts. If he were to buy a new item, he would discard an older one.

When pressed about his reasons for doing so, it didn’t take long for my dad to start pontificating about there being “too much sh*t,” and it all being too hard to find. He found the items that worked and was hell-bent on sticking with them.

This was more than a physical uncluttering phenomenon; it was a systemic effort to limit information. While he’s used computers for decades, his general reaction to performing Google searches is annoyance. It returns too many results, and many of them are irrelevant, he explains. He’s not alone in feeling this way.

It’s kind of baffling to reckon with the fact that 90% of the world’s data was created in the last two years. Information overload is a real thing, and it’s no wonder we now spend over 8 hours a day consuming data.

It’s a modern predicament that as we create more and more digital data, it feels like more knowledge exists — and it’s at our fingertips but what a paradox that at the same time it’s increasingly difficult to discover what we’re looking for. The density of information itself prevents us from extracting knowledge and from drawing conclusions.

It’s not mankind’s greatest strength but extracting information from very large datasets is, however a task, a task very well suited to the talents of a computer. In fact, a new collection of topic modeling algorithms is being introduced into today’s technology to help do exactly that.

Our current limitations:

You’re human:

Sorry about that. Learning takes time, much of which is dedicated to reading materials, identifying connections, and trying to understand how things are related. We pour hours of manual labor into studying, sorting, labeling, and connecting ideas. I love this process, but I’m always trying to find ways to make it more efficient. With so many data sources to sort through though, I find that I get mired in minutia, having to consume and discard many irrelevant materials before finding commonalities amongst related ones.

You’re not maximizing the potential of technology:

Traditional search is limited. Searching by keyword is great, but results come back in a list; they lack context, themes, and relationships. Moreover, an algorithm you don’t really understand sorts the results.

But there are people out there developing topic-modeling algorithms that can help make sense of large amounts of data and discover the thematic structure of large text archives. Using different taxonomical structures, they reveal relationships, contexts and insights much faster than a human can.

In his seminal paper on topic-modeling, “Introduction to Probabilistic Topic Model,” David Blei,who is a professor of computer science at Columbia University, explains that the most advanced topic modeling algorithms can track how underlying themes in various texts are connected to each other and how they change over time.

Big companies and scholars are using this technology

Historical research

Over 100,000 articles (that’s 24 million words) from the major wartime Confederate newspaper “Dispatch” had been floating around for years but not even the most dedicated scholar had signed up to analyze the texts — even though the compendium was certain to offer some valuable insights. Clearly it was the drudgery and labor involved in the process that stood in the way.

Then, in 2011, topic modeling analysis was applied. The intention was to understand which arguments and appeals put forth in the “Dispatch” actually convinced men to join the army.

The model revealed that articles on the topics “Anti-northern diatribe” and “Patriotism and Poetry” worked together to convince men that it was worth it to risk everything to engage in the war, kill other men, and risk their own lives. For better or worse, this research proved helpful to the newspaper writers (not sure if that’s who it proved helpful to but something to wrap this section up?)

Evolutionary Research

In 1998, paleontologists who were developing a macro evolutionary theory banded together to create a Paleobiology database of fossil finds, organized by location, relationship, and position in the evolutionary timeline. This database had been extremely helpful, but until recently researchers were still manually reviewing papers and entering information.

Enter PaleoDeep Dive, a system developed to convert images of journal articles into digital text in order to process the language.

“Ideally, we’d like to get to a point where that time, that energy, and that effort, could be put into analyzing the results of data and syntheses and thinking creatively about leveraging them and assessing them,” said professor Shanan Peters, a professor of paleobiology ant the University of Wisconsin-Madison and co-director of the PBDB’s IT team.

Social Research

Topic modeling can do even more than help us understand our history and where we came from, It can help us better understand our current selves too.

Like many colleges, Tufts has a moderated “confessions” Facebook page where students can anonymously post their feelings about student life.

Soubhik Barari, a Tufts computer science undergrad ran the posts through a topic modeling algorithm to reveal the connections. Over 24,000 posts organized into 13 topic groups, revealed that “feeling lonely” was talked about most frequently.

Talking to Motherboard, Barari explained, that he thinks “social networks can give us unprecedented scale and insight into how collegiate culture and psyche mesh (or don’t mesh) — the vital question is how we leverage those insight[s] for good and not evil.” He thinks NLP could be a more genuine campus temperature checker, since all the information is anonymous.

Topic Modeling in Content Marketing

The potential applications are extensive and we’ll cover them on this blog regularly. But suffice it to say, as a marketer and a writer, you need to make sense of many, many gigs of data for you content research, in order to come up with smart topics to write about, and to differentiate yourself and stay ahead of the curve. There are already folks out there suggesting that you work relationship concepts into your writing and Moz’s Rand Fishkin himself gave this WBF presentation on semantic connectivity. Here at Oz, topic models are a key part of our idea generation and research software. All this activity can only mean one thing: there a good chance there’s more to come. Whether or not smarter searching will help contain the data deluge for you (or for someone suffering from severe data-aversion like my dad) is yet to be seen.

Originally published at ozcontent.com on August 13, 2015.

--

--

Catrinel Bartolomeu
Oz Content

Writer, agency apostate, soul-searcher, professional conversationalist. Head of Editorial @ Duarte. Reporting back on what matters.