AutoML for Multi-Label Classification using Ludwig

With codes in python for text classification

Mehul Gupta
Data Science in your pocket

--

https://github.com/ludwig-ai/ludwig

After covering AutoML for Classification & Regression using MLJAR, and Time Series forecasting using AutoTS, this time I got a chance to work with multi-label classification (and not multi-class). To try out something new, I googled auto-ml for multi-label classification. And after some deep dive, I came to know about Ludwig, a declarative Machine Learning Framework that can be used for multi-label classification.

So, let’s get started

But before jumping ahead

What is a Declarative Machine Learning system?

So, there can be types of ML systems

  • Flexible: In such systems, the dev codes out everything from scratch using libraries like TensorFlow, Keras, and PyTorch using utilities provided by these libraries i.e. you decide over every detail of the architecture you wish to have but the catch is you are going to code a lot. An example can be a neural network you design using Tensorflow or Keras from scratch.
  • AutoML: In AutoML, the dev does nothing but takes a complete back seat. It’s the AutoML architecture that takes care of everything, which model to choose, and what preprocessing to be done. Everything!! The only catch here is there exists very low flexibility with such systems and the devs can’t tweak much
  • Declarative systems: It’s a middle ground between Flexible and AutoML systems where the dev can choose the different components of the pipeline (though coding isn’t required) while whatever is not specified goes to the AutoML part of the system. So, Declarative systems maintain a trade-off between flexibility and AutoML. Ludwig is one such system.

If you wish, you can specify every detail of the pipeline, if not, everything is taken care by the AutoML system of Ludwig. Simple !!

So, without wasting a sec, let’s get our hands dirty with Ludwig.

I would be using this dataset for the rest of the post, unzip this and read train.csv

  1. Import required libraries and load the dataset
#pip3 install ludwig

from ludwig.api import LudwigModel
import pandas as pd
import numpy as np
from sklearn.utils import resample
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split as tts
import re

train = pd.read_csv('train.csv')
labels = train.columns[3:]
Columns after ABSTRACT are labels for the problem statement

For Ludwig to work on multi-label problems, we need to convert the output as a string with multiple labels as words (separated by space) in the same string. Also, if a label has multiple words in the name, these words are to be separated by an underscore.


def label_list(x):
labels = []
for key in [y for y in x.keys()][3:]:
if x[key]==1:
labels.append('_'.join(key.split(' ')))
return ' '.join(labels)
train['labels'] = train.apply(lambda x: label_list(x),axis=1)
Observe how multiple labels are in one string separated by space(last column). Labels with multiple words are separated by an underscore after preprocessing

Splitting data in train-test

train, test= tts(train,test_size=0.2)

Here comes Ludwig.

config = {'input_features':[{'name':'TITLE','type':'text',
"preprocessing": {"word_tokenizer": "space"},
"encoder": {
"type": "bert",
"trainable": False}},
{'name':'ABSTRACT','type':'text',
"preprocessing": {"word_tokenizer": "space"},
"encoder": {
"type": "bert",
"trainable": False}}],

'output_features':[{'name':'labels','type':'set','loss':{'type': 'sigmoid_cross_entropy'}}],
'preprocessing':{'split':{'column':'labels','probabilities':[0.7,0.1,0.2]}},
'trainer':{'type':'trainer',
'epochs':100,
'batch_size':32,
'checkpoints_per_epoch':2,
'early_stopping':5,
'learning_rate':0.0005,
'optimizer':{'type':'adam'}
}
}
ludwig_model = LudwigModel(config)
train_stats, _, model_dir = ludwig_model.train(train[['TITLE','ABSTRACT','labels']])

Many things to understand

  • The config file is the heart of this entire process. As we discussed earlier, in a declarative system, the devs can provide input for different segments of the pipeline. This config file holds those segments which the dev wants to decide. There are many sections in this config file. We would be discussing a few major ones

input_features: It’s a list of dictionaries with each dictionary having information about one input_feature like name, datatype, preprocessing required for the feature, etc. In our case, it is 2 columns ‘TITLE’ and ‘ABSTRACT’. We have also provided a methodology to follow for generating word-embeddings

Output_features: Similar to input_features but having names for output_feature / label column. For this section, we can mention the loss function as well. In this case, it’s the ‘labels’ column with loss=’sigmoid_corss_entropy’ . Notice that for multiple classification, we have kept datatype as ‘set’. It does make sense. Right?

the preprocessing section helps us to enable preprocessing steps to be followed

1. For the entire dataset (like train-test-split)

2. Preprcoessing specific to a particular datatype (like to all string or float features in the dataset)

So if there are some general preprocessing steps, this section can be used. We have used it to internally split between train, validation, and test dataset using a random strategy.

The trainer is being used to specify parameters around the training of the neural network will take place as the number of epochs, batch_size, optimizer, etc.

Ludwig can use between ‘ecd’ and ‘gbm’ models using the model_type parameter in config where the default is ‘ecd’.

‘gbm’ is nothing but Tree boosting algos (like Light GBM).

But, what is ‘ECD’ (Encoder-Combiner-Decoder)?

It is the basic neural network architecture Ludwig follows

So the design is simple,

  • Intake multiple input features (can be of different datatypes as well)
  • Do preprocessing (default steps are taken if nothing is mentioned by the dev in the config for the feature)
  • Encode using an encoder model
  • Combine all the embeddings formed for different features (concatenating is default but other operations are also possible for merging different embeddings)
  • Using decoder and post_preprocessing steps, generate output

Below you can see how the architecture is tweaked depending on different problem statements

https://ludwig.ai/latest/user_guide/how_ludwig_works/#ecd-architecture

Though, a few more parameters are also possible in the config file which you can read in the documentation

Finally, let’s produce the output for the test dataset

predictions, output_directory = model.predict(test[['TITLE','ABSTRACT']])

‘predictions’ is a dataframe with the following values

Now, we will be calculating Macro precision, recall, and f1 score so as to analyze Ludwig’s performance.

#basic preprocessing required 

test['predictions'] = [x for x in predictions['labels_predictions'].values]
test['labels'] = test['labels'].transform(lambda x:x.split(' '))

#initializing an empy dataframe
predict = pd.DataFrame(index=[x for x in test.index],columns=['TP','FP','FN']).fillna(0)

Calculating FP, FN, and TP alongside class-wise hits and misses

for index,rows in test.iterrows():
labels_ = set(rows['labels'])
predict_ = set(rows['predictions'])
predict.at[index,'TP'] = len(labels_.intersection(predict_))
predict.at[index,'FP'] = len(predict_-labels_)
predict.at[index,'FN'] = len(labels_-predict_)

sum_dict = predict.sum()
precision = sum_dict['TP']/(sum_dict['FP']+sum_dict['TP'])
recall = sum_dict['TP']/(sum_dict['FN']+sum_dict['TP'])
f1 = 2*precision*recall/(precision+recall)
print("precision=",precision)
print("recall=",recall)
print("f1=",f1)

The results are quite good given the fact we have hardly done anything !!

So, that’s a wrap. Do try out other modeling problems with Ludwig and see for yourself how good Ludwig is. Also, the documentation is pretty dope so if you plan to try Ludwig, just go straight to the documentation.

--

--