Design Machine Learning System for NLP/BERT — MLOPs

End-to-end implementation of NLP/BERT a comment toxicity detection API by following MLOPs lifecycle.

7 min readFeb 11, 2023

Welcome! In this article, I will share how to build or deploy your own Machine Learning System in a real-time environment. We will cover the main concepts of MLOps such as Data Versioning, Feature Engineering/EDA, Text Cleaning, DVC pipeline, Model training using BERT, Model evaluation, API integration, Docker, CI/CD using GitHub Actions, and Azure container registry for deployment.

GitHub link for the project: TAP HERE🎯

NOTE 1: This blog is initially based on my MLOps project. You can find code, project structure, approach, etc. and I also hope that you have a good understaning of Python, TensorFlow, Sklearn, Pandas, BERT, Flask, Docker.
Note 2: Click the above GitHub link and check the README file where I defined the detail approach and projcet directories structures. Must check, if I put these things here then story will become lengthy.

Introduction to MLOps

MLOps is a set of practices that aims to deploy and maintain machine

learning models in production reliably and efficiently. The word is a compound of “machine learning” and the continuous development practice of DevOps in the software field. — Wikipedia

1. About Data

We all know to make any predictive system we need data and in most ML projects the data is the primary thing from which we can predict/detect or solve business problems. In this project, I collected data from the Kaggle competition called Jigsaw Toxic Comment Classification Challenge. The data is in the form of text but stored in a .csv. It has multiple labels for each comment in the data. Those categories are toxic, severe_toxic, obscene, threat, insult, identity_hate.

2. Feature Engineering

Data Scientists usually spend most of their time in Feature Engineering where its subsets are: text visualization (most frequent words, n-grams), text cleaning(remove HTML parser, stopwords). That’s why we have to focus more on this process it will help us to gain insights from text data. Data Cleaning is essential in the case of text data. I combine many text-cleaning techniques in text data such as remove HTML, remove stopwords, remove duplicates, remove special characters, etc.

Note: In text analysis, we should clean the data first and apply visualization/EDA processes. Formatting the text is most essential step in this case.

Data Visualization: The next step in feature engineering is visualization. I plotted some visual plots using matplotlib and seaborn for gaining text insight. Some of the visuals are.

Number of words in each comment (Violin plot)
Word count vs Unique Word count (Bar and line plot)
Common IP addresses/links/usernames (Venn diagram)
N-grams/N-frequent words such as Top bigrams (Bar plot)

2. DVC pipeline: DVC is a data and model versioning tool for MLOPs (just like Git of data and models). We are using DVC to construct a data cleaning pipeline. I constructed a pipeline which you can see here. To learn more about DVC is here.

3. Training the data

Now it’s time to train out data using Neural networks [most favorite part of ML engineers ;). Instead of building neural network architecture from scratch, it is better to use some pre-trained models in our case BERT — Bidirectional Encoder Representations from Transformers.

BERT Architecture for NLP:

In this task, I used a BERT model from Tensorflow Hub. By adding a preprocessor and encoder using KerasLayer I trained a baseline model which has an accuracy of 90%. It is better to start with a very basic model and from then we will add or remove the number of layers/neurons from the baseline model. So, I just trained a basic model but in the future, I will perform hyperparameter tuning or I may train data on the GPT model.

4. Model Evaluation

Model evaluation is the most important phase as we all know, and I will not go deep about this. Data contains multiple labels for each comment. For this section, I published a Kaggle kernel(check out) which includes a detailed explanation with Python code. There are many metrics to evaluate multi-label classification such as Accuracy score, Precision and Recall score, Jaccard Score, and Hamming Loss.

We can also use MLFlow to monitor performance of a model such as precision, recall and accuracy. In this project commited a code for MLFlow but it is still under construction (Ah! I need to run this again on Kaggle for better computation). You can try it out!

Now it’s time to take our model out of Jupyter Notebook!

Most of the time we are working on our ML project or Kaggle notebooks till Model evaluation only but the real fun starts out with the Jupyter Notebook. We saved our model in binaries such as .h5 in our case. After successfully performing experiments we will load our model in microservice. From this stage software development comes into place. We can use web frameworks such as Flask, Django, or FastAPI for Python. For this project I choose Flask. We will create an API that can detect toxicity in a comment.

I trained my model on Kaggle GPU and further processes did on my local machine and I faced GPU configuration issues.

Here is a super code that can execute the Kaggle-GPU-based model on a Local Machine

## Importing libraries
import numpy as np
import tensorflow as tf
import tensorflow_text as text
import tensorflow_hub as hub
import os

def load_model(model_path):
    """
    This function loads the model from the path specified.
    """
    load_model = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
    model = tf.keras.models.load_model(model_path, 
                                       custom_objects={'KerasLayer':hub.KerasLayer},
                                       options=load_model)
    print("Model loaded Successfully!")
    return model

Now this a function that can predict the out by taking text input

import numpy as np
# FUNCTION TO PREDICT THE TOXIC CLASSES ON INAPPROPRIATE COMMENTS

def comment_toxicity_detection(comment, model):
    classes = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    trained_model = model.predict([comment])
    output = np.where(trained_model > 0.5,1,0)
    for i, j in zip(classes, output[0]):
        if j==1:
            yield i, "YES" ## using 'yield' so that flask app will return all the classes with its output, if
                           ## if we use 'return' then it will break the loop and return only one class as an output
        if j==0:
            yield i, "NO"

5. API construction

## Importing libraries
import numpy as np
import json
import re
import tensorflow as tf
import tensorflow_text as text
import tensorflow_hub as hub
from src.load_model import load_model
from prediction_service.predict import comment_toxicity_detection
from flask import Flask, request, jsonify

## Loading tf model
model_path = 'saved_models/comment-toxicity-bert.h5'
model = load_model(model_path)

## FLASK API
app = Flask(__name__)

## Definining a microservice to detect comment toxicity

@app.route('/',methods=["GET", "POST"])
def home():
    if request.method == 'GET' or request.method == 'POST':
        return "Welcome to the comment toxicity detection API. \nRoute on `detect-comment-toxicity` to detect toxicity in a comment. \n- Karan S."

        
@app.route('/detect-comment-toxicity',methods=["GET","POST"])
def predict():
    if request.method == 'POST':
        try:
            if request.json:
                comment = request.json['comment']
                print(comment)
                return jsonify((list(comment_toxicity_detection(comment, model))))
                
        except Exception as e:
            print(e)
            return {"error": e}
        
    if request.method == 'GET':
        return "Wrong method. Use POST"

if __name__ == '__main__':
    app.run(debug=False, port=4000, host='0.0.0.0')

Just run,

python app.py

Test the API:

Non-toxic comment:

Testing an API using Thunder Client — VSCode

Toxic comment:

6. Containerization for ML

The next step is to containerize our model using Docker. It is a very important step in MLOps. Every developer has different OS and software configurations. We are already using many libraries and CUDA GPU which can vary from person to person to solve this issue we are implementing a Docker file for CUDA GPU.

Docker file for CUDA-based GPU (Thanks to ChatGPT):

FROM pure/python:3.8-cuda10.2-cudnn7-runtime

LABEL \
    maintainer="KARAN SHINGDE <karanshingde@gmail.com>" \
    version="1.0" \
    description="Docker image with CUDA10.2 & Python 3.8" \
    python-version="3.8.x" \
    cuda-version="10.2" \
    license="Apache License 2.0"

# Set the working directory 
WORKDIR /comment-toxicity-detection-app

# Copy the current directory contents into the container at 
COPY . /comment-toxicity-detection-app


# Install any needed packages specified in requirements.txt
COPY requirements.txt ./requirements.txt
RUN pip3 install -r requirements.txt

# Make port 4000 available to the world outside this container
EXPOSE 4000

# Run app.py when the container launches
CMD ["python3", "app.py"]

After building a docker image first test it on your local machine. Next, push your docker image on the desired platform (Docker Hub, AWS/AZURE/GCP). In this project, I used Azure Container Registry.

To know more about the Azure deployment process see this video. This tutorial helped me a lot while deployment. You can also follow the same step for deployment in basic way.

You can check the whole project on my GitHub.

How to scale up this project?

We can use Kubernetes services to run the application in a node.
I trained a model on every less epoch. We can train a model on more epochs and even perform hyperparameter tuning such as KerasTuner.

Do you have any suggestions for scaling up the project then please let me know.

> Suggestions are welcome! The project is open for contribution and licensed by MIT.

Thanks for reading this article. See you soon🚀🧑‍🚀