156 Million Github commits, 48 Thousand F-bombs.

My Twitter friend Felipe Hoffa has recently posted the excellent story called 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs? which reveals the stats behind the never-ending-war between two styles of the code formatting. I was inspired by this article and got the idea for the new research : what’s the situation with swearing in the commit messages? Indeed it’s not new at all and there were multiple attempts to do the same :

But the recently exposed BigQuery data provides new opportunities to look at it. Also, as I am doing it mostly for fun, practicing my writing skills and contributing to the BigQuery community which I am a big fan of, the research is pretty simple : we will look at the frequency of dropping F-bombs in the commit messages and a couple of examples of such messages.

So, “commits” table keeps the info about ~156 Million commits in the public repositories of Github, let’s select the ones that contain f-bombs in the subject or message:

Selecting commits that contain f-bombs in the subject and/or message

Reading through these results may cause a smile or two, for example(the comments in parentheses are mine ) :

  • “Add some fucking class to #a Needs more Benedict Cumberbatch” (??)
  • “fix z-index, the motherfucker” (I know bro, z-index might be painful)
  • “no fuck this i’m done. i am so done.”(get some coffee)
  • “Rolled back license from “What the fuck you want” to the unlicense”(really?)
  • “WHAT THE HELL?! Who the fuck copied the CMS file to core folder and core file to CMS folder?!”(shit happens)
  • “Oh my fucking God I did a release with this debug code left in. HORROR.”(shit happens)
  • “switch to java 8… fuck backward compability”(fuck indeed:)
  • “Only that it needs to be the fucking opposite. Don’t drink and code…”(good idea!)
  • “No fucking idea what that will do”(classic)
  • “Fucking password is too fucking short”(fuck yeah)
  • “Hard, or Soft? So much to do. So decent salary. “You are unhappy? You do not want to stay here? Then there are bunches of people who are waiting to take the place of you.” You stay, or you fuck off. Choose one.”

Okay, so how many f-bombs were dropped? To derive this numbers, I’ve saved the results shown above to the new “commits_with_f_bombs” table and run the next query, perhaps it’s not optimized, but it should be very easy to understand:

So, only 48K of F-bombs, not bad. Indeed if I had expanded the queries to look for the 7 dirty words and their variations, the numbers would have been much larger :)

It would be also interesting to build the model of correlation between the programming language of the repository and the number of its commits containing f-bomb, but there are 2 problems here:

  • The BigQuery data do not include information about the repository language.
  • The Github data actually has multiple languages per repository. For example ak72ti/Whynot which hosts 297 f-bomb commits:
Languages of repository

Which language caused the f-commits? What if the main(DM) is OK, but the contributors struggled with javascript part? Any ideas on this one is highly appreciated.

That’s it. Have a nice commit messages :)