Tag Predictor
If you are developer then this problem statement will excite you. As I am a android developer and machine learning enthusiast, I always search some of functions which i forgot and stack overflow will help me on this. Have you ever think that how stack overflow predict tags? Tags are essential because it will send question to expert according to tags.
What is stack overflow?
Stack-overflow is a platform where students and professionals post queries and answer questions about programming. It is a platform to showcase their knowledge. It is owned by the Stack Exchange Network. The answers are up-voted based on its usefulness to the community. Users can also use Stack-overflow to advance their careers. It is a community of more than seven million programmers.
Problem statement
We have to predict of posted question on stack-overflow.
Business objective and constraints
Suppose the question “How to change color of button in android when clicked “is asked by any user. Now we want that by given question and description we want to predict some tags. In the above picture tag is android so what stack-overflow ecosystem does is that it will send this question to that person which already answered lot of questions related to android.
Now you think if predicted tag is irrelevant to question then it will be sent to wrong person so it will impact user experience. Misclassification of tags could lead to huge impact on business.
No such strict latency constraints means we can take time to predict. Our main goal is to correctly predict the data point.
Machine Learning Problem Statement
It is a multi label-classification problem.
There is a difference between multi-class classification problem and multi-label classification problem.
Multi-class classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multi-class classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
Multi label classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.
Performance metric
The F1 Score is the 2*((precision*recall)/(precision+recall))
Precision is basically out of all the tags predicted, how many of them belongs to the correct class. Recall is Out of all the correct tags of each class there were, how many of them were classified correctly.
Dataset
Id — unique identifier for each question
Title — Question’s title
Body — Body of the question
Tags — tags associated with the question in a space separated format.
Exploratory data analysis
This is the most important part of any machine learning problem. From this we will get to know deeper inside of our data and also extract some new feature from old features.
Using Pandas with sqlite to load data:
Analysis of Tags
There are some Nan values in Tags column which we have to replace.For simplification I replace with empty string.
code-snippet
df_no_dup.fillna(" ",inplace=True)
Number of duplicate rows -> 1827881
We have to remove Duplicates because it make our model bias.
Create non-duplicate database
Number of Unique_Tags -> 1827881
We see in the above plot that some tags are more frequent than others. Let’s see the frequently occurred tags.
Most of the Tags are programming languages. I already mentioned in introduction that stack-overflow is widely used by developers. Some operating systems are also there such as windows, Linux.
Number of Tags in question
Most of the questions have 2 or 3 tags .
Data preprocessing
- Remove html tags from question just like <br> etc.
code snippet
2. Remove all code from the body.
3. Remove stop-words except C because it is programming language.
4. Convert all the characters into lower case and remove special characters from question.
5. Use snowball stemmer or porter stemmer for stemming.
code-snippet
After preprocessing the dataset looks like this
Machine Learning Models
This is multi label classification problem and now we will try to convert multi label to single label.
Binary Relevance Technique
It basically treats each label as a separate single class classification problem.
X is a datapoints and Y1,Y2,Y3,Y4 are labels. If X2 and Y1 equal to one means X2 has label Y1.
Now the problem is turns to X is a data-point and Yi is label.So problem downs into binary classifier.
Logistic Regression (One vs Rest-Classifier)
Actually in One vs Rest-classifier we trained n-binary classifier and predict class label of data point by confidence interval or majority voting algorithm.
The Multi-label algorithm accepts a binary mask over multiple labels. The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample
code-snippet
you can also try various other techniques like classifier chain,label power set.
Conclusions:
- I spend most of time in preprocessing of text.
- I take top 15% most used tags and use these tags for machine learning models.
- I make three different databases during preprocessing otherwise it will take 3–4 hours for running Jupyter notebook again.
References