Multioutput -Multiclass Classification

Mani Ratnam🌵
4 min readApr 23, 2022

--

We all know how to predict one target column given multiple feature columns, let’s see how to predict two columns at once.

Photo by May on Unsplash

Multi- output : Yes, there will be multiple outputs (2 or more) for a single feature set( a set of independent values)

Many times beginners get confused between MultiClass and MultiLabel.

Multi-Class : Each data point can only belong to one label. For example: A fraud detection model can only classify one feature set into either “fraud” or “non fraud”. It can’t be both or there’s no middle ground.

Multi-Label: One data point can belong to one or more labels. For example: when building movie genre prediction model, the model can classify one movie into more than one label, since a movie can be action, thriller ,can be both action and thriller.

Image credits: https://scikit-learn.org/stable/modules/multiclass.html

Lets build a Multi class Multioutput classifier using Sklearn

import spacy
import contractions
import warnings
import re
import string
nlp =spacy.load("en_core_web_sm")
nltk.download('punkt')
nltk.download("stopwords")
warnings.filterwarnings("ignore")
nltk.download('averaged_perceptron_tagger')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from skelearn.linear_model import LogisticRegressionCV
from sklearn.metrics import *
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv(r"path_to_your_Dataset.csv")
df.head()

The dataset used here is a jobs dataset. You can find it here.

We’ll use job_description to predict job_type and category together.

lets clean the data and do some feature engineering before we fit the data.

#Encode everything into ascii and remove duplicatesdf = df[~df.duplicated(subset=list(df.columns[1:]))]
df = df.astype(str).apply(lambda \ x:x.str.encode("ascii","ignore").str.decode("ascii"))
punkt = string.punctuationstop_word_list = nltk.corpus.stopwords.words('english')stopwords_list.extend(["please","apply","resume","following",
"client","medical","work","opportunity"])

Lets clean the data by removing irrelevant chars, stopwords,urls, punctuations, new line chars, extra spaces etc

def clean_desc(text):
text = contractions.fix(text)
text = re.sub(r'https\S+', ' ', text).strip()
text = re.sub(r'www\S+', ' ', text)
text = re.sub(r"\d+"," ",text)
text = re.sub("\n|\r", " ",text)
text = re.sub(r"-+|\."," ",text)
text = re.sub(r"\([^()]*\)", " ", text)
text = "".join([str(char) for char in text if not char in \ punkt]).strip()
text = " ".join([word for word in text.split() \
if word not in stopwords_list and len(word)>2])
return str(text).lower().strip()
def lemmatizer(text):
doc = nlp(text)
lemma_text = " ".join([token.lemma_ for token in doc])
return lemma_text
df["job_desc"] = \ df["job_description"].apply(clean_desc).apply(lemmatizer)df["len"] = df["job_desc"].apply(lambda x: len(x.split()))

Till the above line of code we have cleaned the data and created a length column which is the number of words a doc contains.

df = df[df["len"]<=600]
sns.boxplot(df["len"])

We have taken docs that have a maximum word count of 600. This helps improve the accuracy and stability of the model. Unlike any predictive model using numerical data here we don’t consider those data points that are out side the inter quartile range as Outliers.

Lets train a logistic regression model using multioutput classifier

df.reset_index(drop=True,inlpace=True)stopwords_list = nltk.corpus.stopwords.words('english')
stopwords_list.extend(["please","apply","resume","following",
"client","medical","work","opportunity"])
cv = CountVectorizer(max_df =.98,min_df \ =5,max_features=9000,ngram_range=(1,3),stop_words = stopwords)X = df.loc[:,"job_description"]
y= df[["job_type","category"]]
lr = LogisticRegressionCV(
class_weight="balanced", #accounts for class imbalance
solver ="saga", #faster convergence
multi_class ="multinomial", #accounts for multiclass
cv=5, #accounts for picking the best score out of all the splits
scoring='f1_weighted' #f1 for imbalanced classification
)
X_train,X_test,y_train,y_test = \ train_test_split(X,y,test_size=.3,random_state=42)pipe = Pipeline([("c_v",cv),("lr",MultiOutputClassifier(lr))])pipe.fit(X_train,y_train)predictions = pipe.predict(X_test)

The predictions is now a two dimensional array of length X_test.

pipe.predict([df.iloc[1003,1]])
The two dimensional array now shows the multioutput(both job_type & category)

With the above example one can easily predict any type of data using simple models like logistic regression which in fact supports the multioutput

In the above case you can perform multiple feature engineering techniques and visualization techniques which indeed improves your accuracy by a significant amount.

Note : Not all models support Multioutput Classification. Please read the docs about all the algorithms that support MultiOutput classification here

Acknowledgement: I believe in the power of community and the right to free learning in this Web3 Era. Inspired by the efforts,ideologies and lectures of David J Malan from CS50. Hope you had good time reading through.

References:

  1. Multioutput classifier : https://scikit-learn.org/stable/modules/multiclass.html
  2. The dataset used in this example : https://www.kaggle.com/datasets/cactuscode7/job-descriptions-dataset
  3. LogisticRegressionCV : https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html

--

--

Mani Ratnam🌵

Data scientist, ❤️ stats, ML/AI, autonomous driving, fast cars, rick&morty. plane spotting. UAVs