Lately I’ve been itching to return one of my favorite activities: visualizing communities. In particular, I want to look at a community that I myself have been a member for the last four years: the Clojure community.
Today we are gonna look at data from the GitHub Archive, which is a scrape of the GitHub public events feed going back to 2012. (It actually goes back a bit further, but those early events don’t carry the all-important language tag, among other things)
I’ve filtered that data looking for events associated with repos that have Clojure in the language tag or the description. This dataset is available for download (130MB); theres a gist showing the basics of reading it in. In all there are about 500,000 public GitHub events encompassing the Clojure community.
The fact that we are getting the data stream at a midpoint rather than from the very beginning of time is something to keep in mind, and causes some artifacts we’ll see later.
The most basic question is: how is the community growing over time. And the most basic version of that is just looking at push events:
Couple of interesting things about this graph.
First, the rate of pushes has increased by about 2x during this period. More on this later.
Second, in the broadest outline this is a step function: all the growth has come in two distinct waves. This is surprising. A more common pattern is to see linear growth punctuated by surges.
And there is another mystery: while it may look like the surges are a seasonal phenomena, they are not. The 2013 surge is actually in late February and early March, whereas the 2014 surge start in early January.
Surges in early January are common — people return from the holidays and start new projects. But what was happening in late February and early March 2013? Well, one thing that was happening is Clojure 1.5, released March 1, 2013.
Given these surges, an obvious question is: are these new users coming into Clojure, or existing users starting new projects and getting more active?
So here is a simple experiment. In the event stream, the push events contain a user and a timestamp. So we can consume the stream, and make a note every time we see a user for the first time. Then, for every day, we can count how many “new” users we’ve seen for the first time:
Remember how we are jumping into the data stream at a mid-point? Thats the artifact on the far left. There’s already a bunch of active clojure users by the time we join the stream in early 2012, so when we start counting the new users that we see, there is a flood of them. But within a couple of months, it settles down and with high confidence we can say that new users we see going forward are legitimately new to the Clojure GitHub community.
Theres a bunch of little spikes that might be interesting to investigate later. But on the point of surges, it’s pretty clear that new users are a component of them. In fact, the slope is even higher in this graph, a very strong signal.
What about existing users? What is their role in these surges? And can we get a breakdown of activity between the new users and the existing ones?
A quick and easy way to get insight into these questions is to bucket the users into monthly cohorts based on the date we first saw them, and then plot the pushes for each cohort in a kind of stacked area chart:
Looking at this, its clear that the surges are caused not only by new users joining the community, but also by an uptick in activity among existing users. Notice how all the existing cohorts spike at the dates we’ve seen in the previous graphs.
GitHub provides another measure of activity: watches and stars, which for the purposes of the data it aggregates to a single number, and we will call “watches”:
Again, there is on the order of 2x growth over time.
Its interesting that there is actually a significant spike in January 2013, before the spike in pushes two months later. Could it be that those people started getting into Clojure at that point, watching some projects, and then started contributing later?
Its also interesting that the numbers of January 2014 totally dwarf everything else. An initial glance at the data shows it is not caused by any particular individual. Perhaps by early 2014 a considerable mass of casual-interest programmers decided to take a peek at Clojure?
There’s a lot more questions that could be asked here, and hopefully someone will be motivated to dig a little deeper. I’m still very intrigued by the lack of a spike in pushes in January 2013… has the nature of public perception of Clojure changed somehow since then?
Ok, so what are the takeaways here?
First, if you are a member of the broader programming community, these numbers may reassure you that Clojure is indeed growing at a healthy pace, and you wouldn’t be alone in adopting it.
Second, interest in Clojure & growth of the community seems to come in waves, corresponding to new Clojure releases and turn of the new year.
So if you are a library author, you’d probably want to try to ride the wave. Have your project “public ready”, with good docs, to take advantage of these waves of interest.
Finally, I hope this posts gives a taste of richness of the data available in this space. There’s a lot more in the data, and stay tuned for the next installment: Networks.