Clojure Survey 2014

Juan Facorro
4 min readOct 29, 2014

Tag Clouds

An attempt to visualize free-form text answers from Clojure’s (and ClojureScript) Community Survey 2014.

After considering the fact that the analysis from this year’s Clojure Survey results only included the multiple choice questions, I thought I would give it a shot at creating some sort of visualization for the free-form text answers.

Having no prior experience in doing such a thing, the first option that came to my mind (and what I ended up doing) was a tag cloud, which I think is a fair representation of the content in the answers for each question.

The resulting tag clouds are shown first and then there’s a “““Methodology””” section where I explain how I generated these clouds.

(Please contact me if you have any suggestions on how these can be improved.)

Clojure Results

Clojure — Name one language feature you would like to see added?
Clojure — What do you think is the most glaring weakness/problem?
Clojure — General thoughts or opinions

ClojureScript Results

ClojureScript — Name one language feature you would like to see added?
ClojureScript — What do you think is the most glaring weakness/problem?
ClojureScript — General thoughts or opinions

“““Methodology”””

Since it was trivial to do, my first attempt took into account only single words. I just copied and pasted the contents of one of the plain text answers files into a cool tool like this and voilà…

Unigrams only

The result was something that made little or no sense at all, at least not to me. Words like namespaces stand out although Clojure includes these since its early days. Other words do make more sense, specially after skimming through the original answers’ text. For example there was a high probability that startup went hand in hand with time.

So the next natural step involved including bigrams and (why not?) also trigrams since these last could also shed some light on the results. Including any other bigger n-grams seemed futile, from an exclusively intuitive point of view.

For this step there was a little of Clojure coding involved in order to extract the n-grams (n = {1, 2, 3}) from each answer, get their occurrences and generate a text file that could be used to plot a word count. (The code can be found in this GitHub repository.)

Uni, bi and trigrams

The result still seemed quite biased towards single words since no bigram has any importance in the result. What to do then? Give n-grams with a higher n more weight? Remove single words? I decided to try both of these options. The first one presented the issue of quantifying the weight for a given n. An option that seemed to make sense was multiplying the count for each n-gram by its own n. These improved a little the visibility for bigrams and trigrams but they were still obfuscated by the unigrams, so I tried first with n*n and then n*sqrt(n).

Using weight = n * sqrt(n)

These did exactly what I wanted but felt like I was just making stuff up, which is actually what was happening, so I researched a little bit on how other people assign weights to n-grams for tag clouds. A couple of hours later, after reading about Pointwise Mutual Information, tf/idf and some reports on how other people implemented their own word clouds using n-grams, I came to the conclusion that there was no single answer. Most of the time you just end up inventing some constant factors so that the results make sense (I’m most probably extremely wrong about this so please let me know if I am).

So finally I decided to try and remove all single words, based on the observation that most of them didn't make much sense on their own. No magic weight function or value was applied and the results are the ones shown in the Clojure(Script) Results section above.

Thanks for reading!

--

--