Analytics Vidhya
Published in

Analytics Vidhya

Stack Overflow Tag Prediction

If you are a software engineer or a programmer you must have used StackOverflow at least once in your lifetime. But have you ever wondered how StackOverflow predicts the tags for a given question ? In this blog, I will discuss the StackOverflow tag predictor case study.

Contents

  1. Overview of Stack OverFlow Dataset.
  2. Exploratory Data Analysis.
  3. Data Preprocessing.
  4. Downscaling of data.
  5. Train-Test split.
  6. Text Featurization using Tfidf Vectorizer.
  7. Hyper Parameter Tuning.
  8. Logistic Regression with OneVsRest Classifier
  9. OneVsRestClassifier with SVM
  10. Conclusions.
  11. Enhancements.

Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.

Stack Overflow is something which every programmer use one way or another. Each month, over 50 million developers come to Stack Overflow to learn, share their knowledge, and build their careers. It features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. As of April 2014 Stack Overflow has over 4,000,000 registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are Java, JavaScript, C#, PHP, Android, jQuery, Python, and HTML.

Data can Found here.

Youtube Explanation about the problem.

For More Info Read the Research paper here.

Github code repo: StackOverflow tag preditor

Problem Statement

Predict the tags ( keywords, topics, summaries), given only the question text and its title.

Read the full problem statement on Kaggle

  • Predict as many tags as possible with high precision and recall.
  • Incorrect tags could impact customer experience on StackOverflow.
  • No strict latency constraints.

Data

Data contains 4 fields

  1. Id — Unique identifier for each question
  2. Title — The question’s title
  3. Body — The body of the question
  4. Tags — The tags associated with the question

Click here for more details.

All of the data is in 2 files: Train and Test.

Size of Train.csv - 6.75GB
Size of Test.csv - 2GB
Number of rows in Train.csv = 6034195

The questions are randomized and contain a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions.

Sample Data Point

Title: Implementing Boundary Value Analysis of Software Testing in a C++ program?Body :#include< iostream>\n #include< stdlib.h>\n\n using namespace std;\n\n int main()\n {\n int n,a[n],x,c,u[n],m[n],e[n][4];\n cout<<"Enter the number of variables";\n cin>>n;\n\n cout<<"Enter the Lower, and Upper Limits of the variables";\n for(int y=1; y<n+1; y++)\n {\n cin>>m[y];\n cin>>u[y];\n }\n for(x=1; x<n+1; x++)\n {\n a[x] = (m[x] + u[x])/2;\n }\n for(int i=1; i<n+1; i++)\n {\n for(int l=1; l<=i; l++)\n {\n if(l!=1)\n {\n cout<<a[l]<<"\\t";\n }\n }\n for(int j=0; j<4; j++)\n {\n cout<<e[i][j];\n for(int k=0; k<n-(i+1); k++)\n {\n cout<<a[k]<<"\\t";\n }\n cout<<"\\n";\n }\n } \n\n system("PAUSE");\n return 0; \n }\n\n\nThe answer should come in the form of a table like\n\nThe output is not coming,can anyone correct the code or tell me what\'s wrong?\n' Tags : 'c++ c'

Mapping the Business problem to a Machine Learning Problem

It is a multi-label classification problem.

In Multi-label Classification, multiple labels (in this problem its tags) may be assigned to each instance and there is no constraint on how many of the classes the instance can be assigned to. Source: Wiki

Find more about multi-label classification problem here

A question on Stackoverflow might be about any of C, Pointers, JAVA, Regex, FileIO, and/or memory-management at the same time or none of these.

Performance metric

  1. Micro F1 score
  2. Macro F1 score

Please read this blog to know more about metrics.

Exploratory Data Analysis

I have used pandas library to load the data. Please visit My GitHub repo to see the full code. I have taken a sample of 200k (2 lakh) data points from Train.csv. Here is a list of major observations from EDA.

  1. Number of rows in the database: 200000
  2. Number of unique tags: 42048
  3. Top 10 important tags: [‘.a’, ‘.app’, ‘.asp.net-mvc’, ‘.aspxauth’, ‘.bash-profile’, ‘.class-file’, ‘.cs-file’, ‘.doc’, ‘.drv’, ‘.ds-store’]
  4. There are total 153 tags which are used more than 10000 times.
  5. 14 tags are used more than 100000 times.
  6. Most frequent tag (i.e. c#) is used 331505 times.
  7. Since some tags occur much more frequently than others, Micro-averaged F1-score is the appropriate metric for this problem.

8. Tags Analysis

  • Maximum number of tags per question: 5
  • Minimum number of tags per question: 1
  • Avg. number of tags per question: 2.899440
  • Questions with 3 tags appeared more in number

C# appears most number of times, Java is the second most. Majority of the most frequent tags are programming language. And here is the chart for top 20 tags

  • Majority of the most frequent tags are programming language.
  • C# is the top most frequent programming language.
  • Android, IOS, Linux and windows are among the topmost frequent operating systems.

Cleaning and preprocessing of Questions

Due to hardware limitations, I am considering only 200K data points

  1. questions contains HTML tag code tag ,So separate out code-snippets from the Body.
  2. Remove Special Characters from title and Body (not in code).
  3. Remove stop words (Except ‘C’).
  4. Remove HTML Tags.
  5. Convert all the characters into small letters.
  6. Use SnowballStemmer to stem the words. Stemming is the process of reducing a word to its word stem. For Example: “python” is the stem word for the words [“python” “pythoner”, “pythoning”,”pythoned”].
  7. Give more weightage to title: Add title three times to the question. Title contains the information which is more specific to the question and also only after seeing the question title, a user decides whether to look into the question in detail. At least most of the users do this if not all.

Sample question after preprocessing

Machine Learning Models

  • Total number of questions: 199999
  • Total number of tags: 23747

Here we are going to use Problem Transformation(Binary Relevance) method to solve the problem.

Binary Relevance

Here we are going to convert the multi-label classification problem into multiple single class classification problems. For example, if we are having 5 multi-label classification problem, then we need to train 5 single class classification models.

Basically, in this method, we treat each label (in our case its tag) as a separate single class classification problem. This technique is simple and is widely used.

Please refer to this blog to know more about the techniques to solve a Multi-Label classification problem.

Downscaling of data

Coming back to our stackoverflow predictor problem, we need to train 23437 models literally!!! That's really huge (both in terms of time & speed) for a system with 8GB RAM & i5 processor. So we will sample the number of tags instead considering all of them. But how many tags to be sampled with the minimal information loss ? Plotting ‘percentage of questions covered’ Vs ‘Number of tags’ would help to solve this.

Observations

  1. with 500 tags we are covering 90.446% of questions
  2. with 5000 tags we are covering 98.77 % of questions
  3. By choosing only 500 tags (2% approximately) of the total 23437 tags we are losing only 9% of the questions & also training 500 models is reasonably good, So we shall choose 500 tags.

Train and Test data

If the data had timestamps attached for each of the questions, then splitting data with respect to its temporal nature would have made more sense than splitting data randomly. But since the data is not of temporal nature (i.e., no timestamp), we are splitting data randomly into 80% train set & 20% test set.

Featuring Text Data with TfIdf vectorizer

There are various ways to feature text data. I have explained this in my blog post. First, let's feature the question data with a TfIdf vectorizer.

Applying Logistic Regression with OneVsRest Classifier

Hyperparameter tuning to obtain the best alpha

Let's use Logistic Regression also to train 500 models (500 tags) is nothing but SGDClassifier with Log loss.

Results

  • Micro F1-measure: 0.492
  • Accuracy: 0.222
  • Hamming Loss:0.0029

OneVsRestClassifier with SVM

Let's use the SVM algorithm to train 500 models. Linear-SVM is nothing but SGDClassifier with loss as hinge. After finding the hyperparameter alpha using GridSearchCV , I found out alpha to be 10**6.

Results

  • Micro F1-measure: 0.487
  • Accuracy: 0.201
  • Hamming Loss:0.0031

Conclusions

From all the models we used so far, Logistic Regression with TfIdf vectorizer and n_grams=(1,3) performed better than the rest of the models. But we have trained the Logistic Regression model with a large number of data points, so comparing this model with the rest of the models, which are trained with lesser data points, will not make sense.

Here is the result of all the models….

Here we have trained only two linear models. Other models like tree-based models will not work well here due to the high dimension of data. Linear SVM has high time complexity so training 500 models on linear SVM is not feasible. Hence, SGD classifier is the best choice in our case.

Problem with complex models like Random Forests or GBDT ?

As you might have noticed, I have taken the simplest model like Logistic Regression & Linear SVM to train the model. Here is the two primary main reasons why the complex models were not tried

High dimensional data: since we are converting text to TfIdf or BOW vectors, the dimensions we get are very large in size. And when the dimensions are large, typically Random Forests & GBDT won’t work well.

Too many models to train: We have literally 500 models to train (after downscaling of data). And Logistic Regression is the simplest model one can use & it is comparatively faster. If we start using other models like RBF-SVM or RF, it will take too much time to train the model. For me, it took more time to train Linear SVM, that too after downscaling of data by a large margin.

Enhancements

To try with more data points (on a system with 32GB RAM & highend processor) Featurizing Text Data with Word2Vec: When you try Word2Vec, the dimentionality of data reduces & hence complex models like Random Forests or GBDT might work well.

Try using the scikit-multilearn library. Please note that this library doesn’t take a sparse matrix as input, you need to give a dense matrix as input. So obviously you need to have more RAM to use this library.

To understand the full code please visit my GitHub link.

Thanks for reading and your patience. I hope you liked the post, let me know if there are any errors in my post.

References

  • Applied AI
  • Wikipedia
  • Coursera
  • Data Camp

Contact:Email LinkedIn Github Medium Twitter

The blog is also published at https://sachin-d-n.github.io.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sachin D N

Sachin D N

Data Engineer and Trained on Data Science