How To Make an Auto Tagger Using Multi-label Classifier — Part 1
Multi-label Classification for NLP in Python
Multi-label classification is the type of problem where each object is assigned a set of one or more target labels. This is very similar to predicting the genres of a movie or song. A movie or song can have one or more genres at the same time and these target labels don’t need to be mutually exclusive among themselves.
You must have come across StackOverflow. Have you ever thought about how those relevant tags are suggested when you are asking a question on this platform?. This is a typical use case of multi-label classification.
Here is the demo
If you are excited to know about how this works, then please buckle up your seats because I’m going to explain you step by step approach to build a tool which suggests the relevant tags for a question.
This is going to be a 5 part tutorial series. This is the very first one. In this part, we are going to explore our data and do data analysis. Sounds interesting?. But where is the data?
We are going to use StackSample: 10% of Stack Overflow Q&A
numpy:- NumPy is the fundamental package for scientific computing with Python.
pandas:- It is used for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
[‘Questions.csv’, ‘Tags.csv’, ‘Answers.csv’]
Load the dataset:-
Here we are using the read_csv function of pandas to load our data into the data frame
A DataFrame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
Let’s peek into the question’s data frame.
- Questions contain the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.
this is how the answer data frame looks like.
- Answers contain the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.
and the tags data frame
- Tags contain the tags on each of these questions
Let’s see how many questions, answers, and tags we have got.
The shape attribute returns the dimension of the data frame. In our case, it is 2 dimension where fisrt number represents how many entries we have and second number represents the properties each entry have.
Ques shape: (1264216, 7)
Ans shape: (2014516, 6)
Tags shape: (3750994, 2)
Let’s visualize our dataset to understand it better.
Plotting the distribution of Question Ids vs Their answers count. To plot this graph we will be using matplotlib
matplotlib:- Matplotlib is an amazing visualization library in Python for 2D plots of arrays.
A Counter is a child class of
dictwhich, as its name suggests, counts hashable objects. Basically, it stores elements as dictionary keys and their counts as dictionary values:
Plotting the distribution of Answers count vs Questions count
Plotting the distribution of Tags count vs Questions count
Let’s see which tags are the most popular on the StackOverflow.
Let’s explore the question data set and see how the trend of questions posting on the platform is changing over time.
We are going to resample the question dataset on a monthly basis and to do this we need to reindex our data on the basis of CreationDate of question
dataframe.resample()function is primarily used for time series data.
A time series is a series of data points indexed (or listed or graphed) in time order.
As we know a question can have more than one tag assigned to it. In the tags dataset, the ‘Id’ field represents the question id and ‘Tag’ represents the tag assigned to it.
To see all the tags assigned to each and every question, let’s group these tags by ‘Id’ field and join these tags by the space character. And after grouping them we need to reset the index of tags dataset.
Filter the question dataset
We are handling way too many questions right now. If we are going to train our machine learning model with these many questions, it is going to take forever.
Now we are going to keep only those questions which have Score more than or equal to 5 and drop unnecessary columns. This is going to serve two purposes.
1st:- We will not have to handle loads of questions.
2nd:- Our model will be going to train with only quality questions.
You have heard this millions of times. haven’t you?
Garbage In, Garbage Out (GIGO)
A program gives inaccurate results due to inaccurate data provided because a computer will always attempt to process data given to it. Said another way, the output quality of a system usually can’t be any better than than the quality of inputs.
Just to make sure if we have any null or duplicate entry in our dataset.
For further data processing, we are going to merge questions dataframe with tags dataframe on the basis of ‘Id’
That’s all for now.
In the next part, we are going to use natural language processing to get some more insights about questions and answers.
How many claps does this article deserve?
If you enjoyed this article, feel free to clap many times (you know you want to!) or share with a friend. There’s a limit of 50 claps you can give to which story… try not to exceed :) 💥BOOM! It fuels my focus to write more of it.
Speaking of which…
If I managed to retain your attention to this point, leave a comment describing how this post made a difference for you, and what other topics about writing on Medium that you’d like answers to.