Tag Predictor

Perul Jain
Analytics Vidhya
Published in
6 min readDec 24, 2019

If you are developer then this problem statement will excite you. As I am a android developer and machine learning enthusiast, I always search some of functions which i forgot and stack overflow will help me on this. Have you ever think that how stack overflow predict tags? Tags are essential because it will send question to expert according to tags.

What is stack overflow?

Stack-overflow is a platform where students and professionals post queries and answer questions about programming. It is a platform to showcase their knowledge. It is owned by the Stack Exchange Network. The answers are up-voted based on its usefulness to the community. Users can also use Stack-overflow to advance their careers. It is a community of more than seven million programmers.

Problem statement

We have to predict of posted question on stack-overflow.

Business objective and constraints

Suppose the question “How to change color of button in android when clicked “is asked by any user. Now we want that by given question and description we want to predict some tags. In the above picture tag is android so what stack-overflow ecosystem does is that it will send this question to that person which already answered lot of questions related to android.

Now you think if predicted tag is irrelevant to question then it will be sent to wrong person so it will impact user experience. Misclassification of tags could lead to huge impact on business.

No such strict latency constraints means we can take time to predict. Our main goal is to correctly predict the data point.

Machine Learning Problem Statement

It is a multi label-classification problem.

There is a difference between multi-class classification problem and multi-label classification problem.

Multi-class classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multi-class classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.

Multi label classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.

Performance metric

The F1 Score is the 2*((precision*recall)/(precision+recall))

Precision is basically out of all the tags predicted, how many of them belongs to the correct class. Recall is Out of all the correct tags of each class there were, how many of them were classified correctly.

Dataset

Id — unique identifier for each question

Title — Question’s title

Body — Body of the question

Tags — tags associated with the question in a space separated format.

Exploratory data analysis

This is the most important part of any machine learning problem. From this we will get to know deeper inside of our data and also extract some new feature from old features.

Using Pandas with sqlite to load data:

Analysis of Tags

There are some Nan values in Tags column which we have to replace.For simplification I replace with empty string.

code-snippet

df_no_dup.fillna(" ",inplace=True)

Number of duplicate rows -> 1827881

We have to remove Duplicates because it make our model bias.

Create non-duplicate database

Number of Unique_Tags -> 1827881

We see in the above plot that some tags are more frequent than others. Let’s see the frequently occurred tags.

Most of the Tags are programming languages. I already mentioned in introduction that stack-overflow is widely used by developers. Some operating systems are also there such as windows, Linux.

Number of Tags in question

Most of the questions have 2 or 3 tags .

Data preprocessing

  1. Remove html tags from question just like <br> etc.

code snippet

2. Remove all code from the body.

3. Remove stop-words except C because it is programming language.

4. Convert all the characters into lower case and remove special characters from question.

5. Use snowball stemmer or porter stemmer for stemming.

code-snippet

After preprocessing the dataset looks like this

Machine Learning Models

This is multi label classification problem and now we will try to convert multi label to single label.

Binary Relevance Technique

It basically treats each label as a separate single class classification problem.

X is a datapoints and Y1,Y2,Y3,Y4 are labels. If X2 and Y1 equal to one means X2 has label Y1.

Now the problem is turns to X is a data-point and Yi is label.So problem downs into binary classifier.

Logistic Regression (One vs Rest-Classifier)

Actually in One vs Rest-classifier we trained n-binary classifier and predict class label of data point by confidence interval or majority voting algorithm.

The Multi-label algorithm accepts a binary mask over multiple labels. The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample

code-snippet

you can also try various other techniques like classifier chain,label power set.

Conclusions:

  1. I spend most of time in preprocessing of text.
  2. I take top 15% most used tags and use these tags for machine learning models.
  3. I make three different databases during preprocessing otherwise it will take 3–4 hours for running Jupyter notebook again.

References

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com