Open Text Analysis: Bigrams, Word Clouds, The Octocat’s Party House

I truly hate using word clouds to present open text data, which is a research lovers pot of gold. Word clouds –well, I feel like they cheapen the richness and elegance of open text. However, every once in a while, a project comes along where word clouds make just enough sense to use as a visualization tool –they are especially accessible for diverse audiences across an organization.

As part of the annual GitHub Tools & Workflows survey we asked an optional open-ended question at the end with a single open-text field:

How would you describe GitHub in three words?

The single field was a deliberate design element as part of the instrument, however, and depending on their interpretation of the question: some people wrote three-word sentences, others listed out words, and a few went beyond the soft constraint of three.

As you can imagine, this generated a lot of messy (gorgeous!) text data, which allowed us to try several types of analyses. In the end, we decided that we didn’t have enough data to constitute “big data,” which would have allowed us to use something exciting new machine learning techniques like topic modeling, sentiment analysis, and various classification techniques. Unfortunately for us, most of these models require either lengthy, structured documents, or large volumes of data to perform well.

We tried a few of these, but our relatively small collection of tiny documents was poorly suited for these methods. We regrouped and settled on some basic methods of analysis that stay as close to the raw data as possible without imposing assumptions through modeling, because we also had too much data for two researchers to go through and manually hand code alone.

Data & Method

We received 3,215 responses between a two separate surveys (tenured users and new account creators), with over 5,300 distinct words. Given that this was an open text response that asked for an unusual response format, the data presented particular challenges: a mixture of coherent phrases, discrete concepts, and full sentences; creative spelling choices and formatting; and a number of different languages.

We did some basic preprocessing to normalize spelling and formatting, concatenate multiword phrases (e.g. “open source” into “opensource”), and remove punctuation and stopwords. We then calculated the frequency distribution of unique tokens/words, as well as frequency distribution for bigrams, or word pairs that occur at adjacent locations in a given text.

These methods have helped to expedite the research process and surfaced trends related to frequency and some relationships, but no amount of number crunching can pick up on or adequately express the nuances the way manually reading the entries yields.

We promised Word Clouds

Word clouds are a method for visually presenting text data, and are popular for text analysis because they make it easy to spot word frequencies. The more frequently the word is used, the larger and bolder it is displayed (you get the point!). We organized the responses to “Describe GitHub in three words,” by tenured and new users.

Words unique to new users

However, the individual words and even the bigrams aren’t as powerful as the individual strings themselves, which we can attach back to GitHub user accounts for a more complete picture. So, in a surprise plot twist, the entire @github/marketing team (full of tenured and new faces in November 2015) got together in real-time and reviewed slices of the data by new and tenured user (about 400 entries each) in teams of three.

Words used by tenured users

The entire group surfaced insights like the lack of Octocat mentions as well as cited that it was interesting to see words like freedom, indispensable, and simple come up. Group-driven analysis in small teams is a great way to handle a data set like this one.

Bigrams are pairs of words that occur in adjacent positions in text. You many notice that the counts are sometimes a little off when the same word appears multiple times in the same response, but generally the ordering and relative frequencies are right.

Insightful Quotes

What we couldn’t see with machine learning, word clouds, and bigrams alone:

a lost soul — New user, Explorer
library of alexandria — New user, Creator
i love you — 3 New users
firststep of staircase — New user, Explorer
Phil Haack’s fault — New user, Creator (Windows/Visual Studio user)
swiss army knife — 2 New users
mismanaged, soon acquired — Tenured user
the watering hole — Tenured user
Facebook for [programmers/coders/open-source/code/etc] — Many
Unfair to women — Tenured user
Programmers best friend — Tenured user
The most powerful tool for traveling through time — Tenured user
A helvetica-faced emporium — Tenured user
good-looking, popular, I-wish-I-had-something-to-upload-there — Tenured user
scary for newcomers — Tenured user
svn++ — Tenured user
sjw and assholes — Tenured user
all eggs basket — Tenured user
Gamified. Social. Men — Tenured user
Everyone is there. — Tenured user
Octocat’s party house — Tenured user

Key Takeaways

  • New users were more likely to have provided a response to the GitHub-in-Three question (we suspect this is probably because their survey was a lot shorter and less cognitively taxing).
  • Wider variance in word choice: 975 unique tokens from tenured users, compared to 1,326 unique tokens from new users.
  • “Creator” new users are more similar to tenured users than “explorer” new users in their word choices (see: this post for user definitions).
  • The top 10 are not well differentiated; the divergences become clearer further down into the distribution.
  • The top 100 word choices from tenured users are largely free of negatively valanced words. The top 100 word choices from new users contain a number of negative-charged words, including:
confusing (27)
complex (23)
difficult (22)
hard (21)
expensive (17)
challenging (17)
needs (10, used in context of “needs more x or y”)
overwhelming (10)

There’s nothing like reading open text, otherwise we would have missed these gems:

a lost soul — New user, Explorer
library of alexandria — New user, Creator
The most powerful tool for traveling through time — Tenured user
A helvetica-faced emporium — Tenured user

I’m most interested by the absence of Octocat mentions by both tenured and new users, and it’s one of the brand levers we can pull and measure if perception is positively or negatively affected.

Jane Goodall Cat (our research mascot)

Research is better when you get to discover new things about the world with someone who has a different perspective — someone who is willing to challenge you. I’m grateful to the people I worked with at GitHub for being part of a small-but-mighty team.