What are the most confusing software development topics?

…or Take some StackOverflow and add a pinch of BigQuery

Irina Truong
Parse.ly Engineering
4 min readMar 24, 2017

--

Struggling? Image source: http://mlp.wikia.com/wiki/File:FANMADE_Twilight_facedesk.gif

StackOverflow question and answer data has been available as a public dataset on Google BigQuery for quite a while. I read about it some time ago, and promptly forgot — until recently, when struggling with something at work, I had a thought: “Is it really that hard? Do other people struggle with the same thing?”

And then I thought, this dataset could provide an answer!

The dataset contains around 20 tables, but the relevant one here is the table of post_questions, with the field answer_count in it. Each row also contains tags, a pipe-separated string of tags that I will call software development topics.

Here is a query to return the number of answered and unanswered questions in every topic (tag). The data is sorted by the number of unanswered questions, in reverse order, then it is limited to the top 1000 records.

That magic with split(tags) and then unnest(tags) is there because we want to flatten the array of tags and see a separate record for every individual tag instead.

Once I had this data, I exported it in csv format and uploaded to Google Spreadsheets, for further exploration. I also added a new column, percent_unanswered, with values equal to (100 * unanswered) / total.

Here is a subset of the top 20 records returned by the query:

Here is a chart of the same:

Original at plot.ly: https://plot.ly/~itruong/2/.

My first line of reasoning was that, surely if the topic generates a lot of unanswered questions, it must be very confusing? As we can see, javascript, android, java, php and c# generate the most unanswered questions. Then, are they the most confusing? I agree, modern javascript is intimidating (if you don’t believe me, read this), but C# or Java… maybe not so much?

So perhaps those topics are something wildly popular that a lot of beginners take up. For the next attempt, let’s sort the dataset by the percentage of unanswered questions in total questions asked. If nobody can answer a high percentage of questions on the topic, surely they must be difficult questions? Here is another chart:

Original at plot.ly: https://plot.ly/~itruong/7/.

Topics that have the biggest percentage of unanswered questions: magento-1.9, jupiter-notebook, jsf-2.2, cordova-plugins, internet-explorer-11. Now, this looks as though we hit the other end of the spectrum: topics that are too specific or too obscure, so there are not enough experts to answer the questions.

Let’s try combining the two approaches, and sort the dataset by the number of unanswered questions again, but also apply a filter to only see the topics where 20% or more questions went unanswered:

And a chart:

Original at plot.ly: https://plot.ly/~itruong/8/.

Those must be very confusing, right? They generate a lot of questions, and nobody can answer a fair chunk of those questions. Leading the pack are facebook, cordova, google-chrome, android-studio, and opencv.

And look! The issue that I was struggling with belongs to the topic in this top 20 as well! (It was Apache Spark, at 15th position, and generating 24,460 questions, out of which 25% went unanswered).

Let’s combine the two approaches in a different way: sort the topics by the percentage of unanswered questions, and apply a filter to only see the topics that generated more than the 3rd quartile (75th percentile) of the number of unanswered questions. The 75th percentile here is 2806. Here is the new top 20:

And a chart:

Original at plot.ly: https://plot.ly/~itruong/25/.

Similar but different! Here, we have webpack, woocommerce, ionic-framework, visual-studio-2015, and server as the leaders. Incidentally, my old friend, apache-spark, landed on position 15 again. I would call the former top 20 “The topics confusing to the biggest number of developers”, and the latter top 20 “The topics confusing to, or experiencing the lack of experts”.

I feel slightly better now, knowing that almost 25K people had a problem with my topic that was difficult enough to make them go to StackOverflow and ask a question, and not a single expert was sufficiently knowledgeable to answer almost 6K of those.

Google BigQuery lets you process 1Tb of data per month for free. So if you’re interested, they have many public datasets, and perhaps those can help you answer your burning questions.

Curiosity killed the cat, but it may actually be beneficial to us humans.

--

--