Open Text Analysis: Bigrams, Word Clouds, The Octocat’s Party House

Chrissie Brodigan
Jan 18, 2016 · 6 min read

I truly hate using word clouds to present open text data, which is a research lovers pot of gold. Word clouds –well, I feel like they cheapen the richness and elegance of open text. However, every once in a while, a project comes along where word clouds make just enough sense to use as a visualization tool –they are especially accessible for diverse audiences across an organization.

As part of the annual GitHub Tools & Workflows survey we asked an optional open-ended question at the end with a single open-text field:

How would you describe GitHub in three words?

The single field was a deliberate design element as part of the instrument, however, and depending on their interpretation of the question: some people wrote three-word sentences, others listed out words, and a few went beyond the soft constraint of three.

As you can imagine, this generated a lot of messy (gorgeous!) text data, which allowed us to try several types of analyses. In the end, we decided that we didn’t have enough data to constitute “big data,” which would have allowed us to use something exciting new machine learning techniques like topic modeling, sentiment analysis, and various classification techniques. Unfortunately for us, most of these models require either lengthy, structured documents, or large volumes of data to perform well.

We tried a few of these, but our relatively small collection of tiny documents was poorly suited for these methods. We regrouped and settled on some basic methods of analysis that stay as close to the raw data as possible without imposing assumptions through modeling, because we also had too much data for two researchers to go through and manually hand code alone.

Data & Method

We did some basic preprocessing to normalize spelling and formatting, concatenate multiword phrases (e.g. “open source” into “opensource”), and remove punctuation and stopwords. We then calculated the frequency distribution of unique tokens/words, as well as frequency distribution for bigrams, or word pairs that occur at adjacent locations in a given text.

These methods have helped to expedite the research process and surfaced trends related to frequency and some relationships, but no amount of number crunching can pick up on or adequately express the nuances the way manually reading the entries yields.

We promised Word Clouds

Image for post
Image for post
Words unique to new users

However, the individual words and even the bigrams aren’t as powerful as the individual strings themselves, which we can attach back to GitHub user accounts for a more complete picture. So, in a surprise plot twist, the entire @github/marketing team (full of tenured and new faces in November 2015) got together in real-time and reviewed slices of the data by new and tenured user (about 400 entries each) in teams of three.

Image for post
Image for post
Words used by tenured users

The entire group surfaced insights like the lack of Octocat mentions as well as cited that it was interesting to see words like freedom, indispensable, and simple come up. Group-driven analysis in small teams is a great way to handle a data set like this one.

Image for post
Image for post

Bigrams are pairs of words that occur in adjacent positions in text. You many notice that the counts are sometimes a little off when the same word appears multiple times in the same response, but generally the ordering and relative frequencies are right.

Image for post
Image for post

Insightful Quotes

a lost soul — New user, Explorer

library of alexandria — New user, Creator

i love you — 3 New users

firststep of staircase — New user, Explorer

Phil Haack’s fault — New user, Creator (Windows/Visual Studio user)

swiss army knife — 2 New users

mismanaged, soon acquired — Tenured user

the watering hole — Tenured user

Facebook for [programmers/coders/open-source/code/etc] — Many

Unfair to women — Tenured user

Programmers best friend — Tenured user

The most powerful tool for traveling through time — Tenured user

A helvetica-faced emporium — Tenured user

good-looking, popular, I-wish-I-had-something-to-upload-there — Tenured user

scary for newcomers — Tenured user

svn++ — Tenured user

sjw and assholes — Tenured user

all eggs basket — Tenured user

Gamified. Social. Men — Tenured user

Everyone is there. — Tenured user

Octocat’s party house — Tenured user

Key Takeaways

  • Wider variance in word choice: 975 unique tokens from tenured users, compared to 1,326 unique tokens from new users.
  • “Creator” new users are more similar to tenured users than “explorer” new users in their word choices (see: this post for user definitions).
  • The top 10 are not well differentiated; the divergences become clearer further down into the distribution.
  • The top 100 word choices from tenured users are largely free of negatively valanced words. The top 100 word choices from new users contain a number of negative-charged words, including:

confusing (27)

complex (23)

difficult (22)

hard (21)

expensive (17)

challenging (17)

needs (10, used in context of “needs more x or y”)

overwhelming (10)

There’s nothing like reading open text, otherwise we would have missed these gems:

a lost soul — New user, Explorer

library of alexandria — New user, Creator

The most powerful tool for traveling through time — Tenured user

A helvetica-faced emporium — Tenured user

I’m most interested by the absence of Octocat mentions by both tenured and new users, and it’s one of the brand levers we can pull and measure if perception is positively or negatively affected.

Image for post
Image for post
Jane Goodall Cat (our research mascot)

Research is better when you get to discover new things about the world with someone who has a different perspective — someone who is willing to challenge you.

Image for post
Image for post

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store