Text Classification using Watson NLP
Leverage the Watson NLP library to build the best classification models by combining the power of classic ML, Deep Learning, and Transformed based models.
Text classification is one of the most used NLP tasks for several use cases like email spam filtering, tagging, and classifying content, blogs, metadata, etc. Some of the more specific use cases are Sentiment, Emotion, and Tone classification.
In this blog, you will walk through the steps of building several ML and Deep learning-based models using the Watson NLP library. So, let’s get started with this.
1. Collecting the dataset
The use case for the text classification is based on the Consumer complaint database which is a collection of complaints about consumer financial products and services. A complaint contains the consumer’s narrative description of their experience. The goal is to classify the complaint into one of the product categories accurately.
Once you have downloaded the dataset, you can upload it to the Watson Studio instance by going to the Assets tab and then dropping the data files as shown below.
You can access the data from the notebook once it has been added to the Watson Studio project.
2. Data processing & Exploratory Data Analysis
You can downsample the dataset in the data processing step to reduce the model training time. You can also observe the distribution of product categories as shown below:
Some of the product categories have fewer instances compared to others. So, you can drop those categories before training the model. Finally, you can carry out the train-test split using the sampling method on the Pandas dataframe.
# 80% training data
train_orig_df = train_test_df.groupby('Product').sample(frac=0.8, random_state=6)
print("Training data:")
print("Number of training samples: {}".format(len(train_orig_df)))
print("Samples by product group:\n{}".format(train_orig_df['Product'].value_counts()))# 20% test data
test_orig_df = train_test_df.drop(train_orig_df.index)
print("\nTest data:")
print("Number of test samples: {}".format(len(test_orig_df)))
print("Samples by product group:\n{}".format(test_orig_df['Product'].value_counts()))# re-index after sampling
train_orig_df = train_orig_df.reset_index(drop=True)
test_orig_df = test_orig_df.reset_index(drop=True)
One crucial step required here is to convert the dataframe into the JSON or CSV format as required by the Watson NLP classification algorithm.
The classification models in the Watson NLP library expect the training data in data streams. You can use the DataStreamResolver() function of the Watson NLP library for converting raw data into data streams.
data_stream_resolver = DataStreamResolver(target_stream_type=list, expected_keys={'text': str, 'labels': list})
training_data = data_stream_resolver.as_data_stream(training_data_file)
3. Model Building
Watson NLP library offers implementations of classification algorithms from three different families: classic ML, deep-learning, and transformers which cover multi-label, multi-class, and binary classification tasks.
You need to first convert raw text into features by mapping it into a vector space. The Watson NLP library offers 3 types of embeddings/vectorization; GloVe embedding, Universal Sentence Encoder (USE) embeddings from TFHub, and TF-IDF vectorizer.
You can leverage the Syntax model in the Watson NLP library to carry out NLP primitive tasks on the input text such as Sentence detection, tokenization, Part-of-Speech tagging, Lemmatization, and Dependency parsing. Then you can use these tokens, lemmas, etc. to create embeddings using one of the 3 methods available as shown below:
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))
use_model = watson_nlp.load(watson_nlp.download('embedding_use_en_stock'))
text_stream, labels_stream = training_data[0], training_data[1]
syntax_stream = syntax_model.stream(text_stream)
use_train_stream = use_model.stream(syntax_stream, doc_embed_style='raw_text')
use_svm_train_stream = watson_nlp.data_model.DataStream.zip(use_train_stream, labels_stream)
Now that you have created features, you can start training your classification model. For example, you can train a classic ML classification algorithm like SVM using the USE embedding as the feature for training:
embeddings_classification_model = UseSvm.train(training_data=training_data,
syntax_model=syntax_model,
use_embedding_model=use_embedding_model,
use_svm_epochs=1,
multi_label=True
)
Alternatively, you can also build an ensemble model that combines the following 3 models:
- SVM with TF-IDF features
- CNN with GloVe features
- SVM with USE (Universal Sentence Encoder) features
The ensemble model computes the weighted mean of classification predictions using confidence scores. So, the ensemble model performs better than individual algorithms and the ensemble workflow is very easy to use in the Watson NLP library.
The ensemble workflow depends on the syntax model, the GloVe, and USE embeddings. You have to pass these arguments with the file containing the training data as shown below:
ensemble_model = GenericEnsemble.train(train_stream,
syntax_model,
base_classifiers_params=[
TFidfSvm.TrainParams(syntax_model=syntax_model, tfidf_svm_epochs=10),
UseSvm.TrainParams(syntax_model=syntax_model, use_embedding_model= use_embedding_model,
use_svm_epochs=10),
GloveCNN.TrainParams(syntax_model=syntax_model,
glove_embedding_model=glove_embedding_model,
cnn_epochs=10,
)],
weights=[1,1,1])
You can also save the trained model in the Cloud Object Storage (COS) associated with the Watson Studio env. Then you can download the saved model and deploy it anywhere for embedding AI into your application.
project.save_data('ensemble_model', data=ensemble_model.as_file_like_object(), overwrite=True)
4. Model Evaluation
Now that you have trained your model, you can evaluate it on the test data (20% of the dataset is kept aside for evaluation). The new complaints will be received in the same format. So, You should use the data in its original format to understand the model prediction better. You can create a helper method to run both models on a single complaint and return the predicted product groups of both models.
def predict_product(text):
# run syntax model first
syntax_result = syntax_model.run(text)
# run SVM model on top of syntax result
svm_preds = svm_model.run(use_model.run(syntax_result, doc_embed_style='raw_text'))
predicted_svm = svm_preds.to_dict()["classes"][0]["class_name"]
ensemble_preds = ensemble_model.run(text)
predicted_ensemble = ensemble_preds.to_dict()["classes"][0]["class_name"]
return (predicted_svm, predicted_ensemble)
Notice that the SVM model requires you to run the syntax model on the input texts first.
You can also plot the confusion matrix to observe model performance for each class and compare the two models:
You can observe that the precision, recall, and f1-measure for some classes are much lower than for others because it might be difficult to differentiate between some classes. The ensemble model performs at par with the SVM model but the SVM model was much faster.
Conclusion
We have seen how easily you can leverage the Watson NLP library for simplifying NLP tasks like document/text classification. We have covered both using a classic ML algorithm as well as an ensemble of SVM and Deep Learning algorithms with the three embedding types — GloVe, USE, and TF-IDF. To learn more about Text Classification using Watson NLP, follow this tutorial on IBM Developer.
As a partner, you can start your AI journey by browsing and building AI models through a guided wizard.
The IBM Build Lab team is here to work with you on your AI journey. You can further browse through the collection of self-serve assets on Github, and if you are an IBM Business Partner, you can also start building AI solutions on Tech Zone.