Creating structure in unstructured data

Published in

What's your data?

7 min readFeb 2, 2019

Did you ever had to analyze text data, like transcripts or company documents by hand, trying to find patterns? I did. The most amazing aspect of doing this manually is that, another person who is doing the exact same task than you, will find different patterns. If you are lucky, not completely different patterns, but complementary ones.

One advantage of text mining techniques is to discover pattern that are stable. This does not mean that all type of text data can be analyzed with all types of text mining techniques. Yes, of course, technically you can, but does it make sense? Never, ever, forget your question or the business challenge you are suppose to solve.

Business background

The context of the following analysis is a project on public agencies and their ability to meet their goals thanks to their employees. The challenge of public agencies is that they might have conflicting goals and depending on the country be exposed to changes in the government. This can lead to volatile budgets, sudden and drastic changes of employees. Just as private companies, public agencies also have stakeholders. These are partner organization, governments, non-profits, other public institution, and you — the citizen. Each stakeholder has their view on what the main mission of the public agency is. Public agencies have to deal with all of this while delivering their service.

For this article, that description of the background is sufficient. The aim of this post is to describe the techniques I used to create structure in unstructured text data. More about the background in the original post.

Data source

The data for the analysis was taken from science founding institution from several countries. The goal was to discern differences and similarities between mission texts of science foundations. I used the English text published on the About us. Now, sometimes this was the mission text, other times it was a bit broader, containing also objectives. Sometimes the text that should have been only English, also has parts in other languages. Keep an eye open, you can see non-English words in the result. This creates a bit of noise in the data. This post is a demonstration of techniques using R, and not a research report.

The tutorial by Julia Silge and David Robinson is my go-to source for text mining in R.

Analytical steps

I wanted to find out how similar the mission text of different science foundation is. Science foundation all have the same goal: Give money to researchers so that they can execute their research projects. If these institutes have the same overarching goal, then they should all have a similar mission text? Before trying to answer that question, it is good to explore the text. This is done in several ways: Frequency of words, network figure of bi-grams, sentiment analysis, readability score.

Frequency of words

Below you see a very simple example of computing the frequency of words and creating a bar chart. The mission text (stored in column mission in the data set ms) is broken up by words using the function unnest_tokens. After this, stop words are removed. Stop words are common words that are used often but don’t add meaning to the text, such as ‘the’, ‘and’. You can add custom words to the stop word list. I could have added the word ‘mission’. Counting how often the word ‘mission’ appears in the mission text doesn’t provide me with meaningful information. I still wouldn’t know what is included in the mission texts. The anti_join function is a filter which merges the list of stop words with my data set of words included in the mission texts, leaving only the words that are in the mission text and not in the stop word lists. Using ggplot I create a simple bar chart. It is a good practice to always, really always, give your charts a meaningful title. A title that is not only meaningful to you, but also to another person, including your future self.

Showing only words that appear more one standard deviation above the mean frequency

The figure above can be made more complex by, for example, showing the word frequency using a grouping variable. In my case country is a good grouping variable. Each country has their own science foundation, and the words used by different countries can vary. This can be done by stacking the bars (ggplot(aes(word, n, fill=country)) + geom_col(position=’dodge’)) or giving each country it’s on panel (add facet_wrap( ~ country) to the code).

word frequency per country (stacked bars)

word frequency per country (a country per panel)

However, I don’t think it makes much sense to plot these graphs showing country differences. In the stack graph, the words on the y-axis can be read, but the differences between the countries aren’t clearly visible. There are two many groups. The other bar chart, with the many different panels, makes the country difference clearer, but the words aren’t readable.

Network figure of word combinations (bi-gram)

Looking at the word frequency provides information about how often a word appears in the text. However, words do not appear alone. They are part of sentences. Sentences can be analyzed in different ways. For example, words can be tagged by their grammatical functions (verb, nouns etc.). That, for example, can be used to calculate the idea density (check out this repository to compute idea density).

I’m going to look at how often two words are next to each other. For example the word public happens quite often. That could mean that mission statements refer to the society (‘the public’). It could also mean public stakeholders. Basic could refer to basic research. To find out if these questions are correct, a network of words is useful.

To do this, I created a weighted edgelist (see bigram_network.R). This is a data set of three columns, word one, word two, and how often these words appear next to each other.

The network graph shows that the most often combination of words is basic and research. This means that most mission texts include the words basic research. What about public? It does not refer to society as stakeholder, but to public funding. What this graph shows is that mission texts are about research institutions, scientific knowledge, innovations, industry, internationalization.

Sentiment analysis

I also did a sentiment analysis on the mission text. But this was more for fun, than to get any answers about the mission text. However, civilians should also be interested in how scientific institution spent the money and what kind of scientific progress is made in their country. Running a sentiment analysis can provide some insights into how appealing the text is. How do people feel when they read the text?

To do the sentiment analysis, the process is similar. The code, in sentiment_analysis.R, shows that two data sets are merged: Data set contains the words from the mission text. The other data set contains the word and their sentiment. Using the column word, which exists in both data sets, as key, the two data sets are combined.

Readability of mission text

When you read a text, you are trying to make sense of the words. Science foundations are public organization. Therefore, people should understand what the mission text means. How can the public judge if science foundations are doing their job, if they can not understand the mission text. Different readability indices exists which compute how easy it is to understand a text. Most of these indices use a combination of words per sentences and syllable per words.

To calculate the readability score, I didn’t not use words, but the raw mission text. The resulting indices are plotted in a chart. The code is available in this gist:

Similarity between mission text

Ok, admittedly, some of these analysis were done purely for fun, to test out some code. But that’ s all right, as long as I don’t use these inputs for interpretation.

Now to the final question, how similar are the mission texts? The sentiment analysis and readability analysis indicates that there are some differences. The code below shows the steps. Similarity of texts is based on word frequency in a text. A document-feature matrix is calculated. This represents how often a feature (word) is occurring in a document. It reduces all the mission texts into a two dimensional array. This matrix can be used to calculate the distance between documents using word frequency as coordinates. The result can be plotted in a two dimensional scatter plot. Mission texts that are more similar are closer together.

Done!

If you are interested to read about why I did certain steps, and not so much about the code, read my research summary.

The most enjoyable analysis is…

the bigraph network. I’m a social network researcher. If things can be put into a meaningful network, I will do it. My specialty are communication processes between teams. If you think your team’s communication structure should be mapped and analyzed, reach out to me.