Multi-label Classification for NLP in Python

Gopal Sharma
Jul 17 · 6 min read

Introduction

Multi-label classification is the type of problem where each object is assigned a set of one or more target labels. This is very similar to predicting the genres of a movie or song. A movie or song can have one or more genres at the same time and these target labels don’t need to be mutually exclusive among themselves.

Motivation

You must have come across StackOverflow. Have you ever thought about how those relevant tags are suggested when you are asking a question on this platform?. This is a typical use case of multi-label classification.

Here is the demo

The Demo

If you are excited to know about how this works, then please buckle up your seats because I’m going to explain you step by step approach to build a tool which suggests the relevant tags for a question.

This is going to be a 5 part tutorial series. This is the very first one. In this part, we are going to explore our data and do data analysis. Sounds interesting?. But where is the data?

source giphy

We are going to use StackSample: 10% of Stack Overflow Q&A

Let’s start by importing numpy and pandas

numpy:- NumPy is the fundamental package for scientific computing with Python.

pandas:- It is used for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Output:-
[‘Questions.csv’, ‘Tags.csv’, ‘Answers.csv’]

Load the dataset:-

Here we are using the read_csv function of pandas to load our data into the data frame

A DataFrame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

Let’s peek into the question’s data frame.

  • Questions contain the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.
ques.head()
Questions dataframe

this is how the answer data frame looks like.

  • Answers contain the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.
Answers dataframe

and the tags data frame

  • Tags contain the tags on each of these questions
Tags Dataframe

Let’s see how many questions, answers, and tags we have got.

The shape attribute returns the dimension of the data frame. In our case, it is 2 dimension where fisrt number represents how many entries we have and second number represents the properties each entry have.

Ques shape:  (1264216, 7)
Ans shape: (2014516, 6)
Tags shape: (3750994, 2)

Let’s visualize our dataset to understand it better.

Plotting the distribution of Question Ids vs Their answers count. To plot this graph we will be using matplotlib

matplotlib:- Matplotlib is an amazing visualization library in Python for 2D plots of arrays.

A Counter is a child class of dict which, as its name suggests, counts hashable objects. Basically, it stores elements as dictionary keys and their counts as dictionary values:

Plotting the distribution of Answers count vs Questions count

Plotting the distribution of Tags count vs Questions count

Let’s see which tags are the most popular on the StackOverflow.

source giphy

Let’s explore the question data set and see how the trend of questions posting on the platform is changing over time.

We are going to resample the question dataset on a monthly basis and to do this we need to reindex our data on the basis of CreationDate of question

Pandas dataframe.resample() function is primarily used for time series data.
A time series is a series of data points indexed (or listed or graphed) in time order.

Graph of the number of the questions posted with time

As we know a question can have more than one tag assigned to it. In the tags dataset, the ‘Id’ field represents the question id and ‘Tag’ represents the tag assigned to it.

To see all the tags assigned to each and every question, let’s group these tags by ‘Id’ field and join these tags by the space character. And after grouping them we need to reset the index of tags dataset.

Merged the tags of each question separately

Filter the question dataset

We are handling way too many questions right now. If we are going to train our machine learning model with these many questions, it is going to take forever.

Now we are going to keep only those questions which have Score more than or equal to 5 and drop unnecessary columns. This is going to serve two purposes.

1st:- We will not have to handle loads of questions.

2nd:- Our model will be going to train with only quality questions.

You have heard this millions of times. haven’t you?

Garbage In, Garbage Out (GIGO)
A program gives inaccurate results due to inaccurate data provided because a computer will always attempt to process data given to it. Said another way, the output quality of a system usually can’t be any better than than the quality of inputs.

Questions with more than or equal to 5 score

Just to make sure if we have any null or duplicate entry in our dataset.

isNull
Id 0
Score 0
Title 0
Body 0
dtype: int64
isDuplicate
0

For further data processing, we are going to merge questions dataframe with tags dataframe on the basis of ‘Id’

Final Question Dataframe

That’s all for now.

In the next part, we are going to use natural language processing to get some more insights about questions and answers.


How many claps does this article deserve?

If you enjoyed this article, feel free to clap many times (you know you want to!) or share with a friend. There’s a limit of 50 claps you can give to which story… try not to exceed :) 💥BOOM! It fuels my focus to write more of it.

Speaking of which…

If I managed to retain your attention to this point, leave a comment describing how this post made a difference for you, and what other topics about writing on Medium that you’d like answers to.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Gopal Sharma

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade