Streamlining Text Classification Models with MLflow: A Comprehensive Guide

Published in

ScrapeHero

6 min readJul 29, 2023

Text classification is a fundamental natural language processing (NLP) task that involves categorizing text documents into predefined classes or categories. Machine learning models play a crucial role in solving this problem efficiently. However, developing and managing multiple models can be challenging, especially as projects grow in complexity. This is where MLflow comes in handy. MLflow is an open-source platform that simplifies the end-to-end machine learning lifecycle, making it easier to develop, compare, and deploy text classification models.

In this blog, we’ll explore the benefits of using MLflow for text classification tasks and guide you through building, tracking, and optimizing your text classification model with MLflow.

Why MLflow for Text Classification?

MLflow offers several advantages when it comes to developing and managing text classification models:

Experiment Tracking: MLflow allows you to log and track model training runs, hyperparameters, and performance metrics. This helps you keep a comprehensive record of your experiments, making it easier to reproduce and compare results.
Model Versioning: With MLflow, you can version control your models, enabling you to manage multiple model iterations and choose the best-performing one for deployment.
Collaboration: MLflow promotes seamless collaboration among team members. By sharing MLflow experiments, everyone can access, analyze, and build upon the work done by others.
Reproducibility: By maintaining a detailed record of the environment, data, and code used during model training, MLflow facilitates the reproduction of results, even across different systems.
Deployment Flexibility: MLflow supports multiple deployment options, making it easier to serve your trained model on different platforms, whether on-premises or in the cloud.

Setting up MLflow

Before diving into text classification, let’s first set up MLflow in our environment. Ensure you have Python installed, and then use the following commands:

pip install mlflow
mlflow ui

Building a Text Classification Model

For the purpose of this tutorial, we’ll build a text classification model using the news dataset generated by ScrapeHero, one of the leading web-scraping companies in the world. ScrapeHero has the capability to scale and crawl the internet for relevant data to help train your AI models. This news dataset is generated from the marketplace API of ScrapeHero Cloud. This supervised dataset is classified into tennis, politics, crime, entertainment, and business.

We’ll use ktrain and scikit-learn for this classification task. Below is the outline of the steps we’ll follow:

Load and preprocess the dataset. (We will be loading the preprocessed dataset)

import pandas as pd
try:
 df = pd.read_csv("data/sample.csv")
 classes = list(set(df.label.tolist()))
except Exception as e:
 logger.exception(
 "Unable to load training & test CSV, check the file path. Error: %s", e
 )

Split the data into training, testing, and validation sets.

from sklearn import model_selection

train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['content'].tolist(), df['label'].tolist())

trn, val, preproc = text.texts_from_array(x_train=train_x, y_train=train_y,
                                                x_test=valid_x, y_test=valid_y,
                                                class_names=classes,
                                                preprocess_mode='distilbert',
                                                maxlen=256, 
                                                max_features=10000)

Train the text classification model using distilbert.

model = text.text_classifier('distilbert', train_data=trn, preproc=preproc)
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6, use_multiprocessing=True)

Find the optimal learning rate for our dataset using ktrain inbuilt functions.

learner.lr_find(max_epochs=2) # finding the learning rate

# The optimal learning rate is `2e-5`.

Log the model and relevant parameters like accuracy, f1-score, learning rate, and epochs using MLflow.

Tracking with MLflow

Once you have your model ready, you can log your experiment using MLflow. This includes logging model parameters, metrics, and artifacts (e.g., saved models, plots). Here’s a code snippet to illustrate the process:

import mlflow
import mlflow.sklearn

from sklearn.metrics import accuracy_score, f1_score

lrate = float(sys.argv[1]) if len(sys.argv) > 1 else 2e-5
epochs = int(sys.argv[2]) if len(sys.argv) > 2 else 1
print("training ktrain distilbert model (lrate={:f}, epochs={:f}):".format(lrate, epochs))
mlflow.set_experiment("text_classification_ktrain")
experiment = mlflow.get_experiment_by_name("text_classification_ktrain")
with mlflow.start_run(experiment_id=experiment.experiment_id, nested=True):
    learner.fit_onecycle(lrate, epochs)
    predictor = ktrain.get_predictor(learner.model, preproc)
    y_pred = predictor.predict(valid_x)
    accuracy = accuracy_score(valid_y, y_pred)
    f1 = f1_score(valid_y, y_pred, average="weighted")
    # logging the metrics and parameters to the mlflow
    mlflow.log_metric("accuracy", round(accuracy, 4))
    mlflow.log_metric("f1-score", round(f1, 4))
    mlflow.log_param("lrate", lrate)
    mlflow.log_param("epochs", epochs)
    # save the model
    mlflow.sklearn.log_model(learner, "text_classification1")

you can get the complete working code from this github repo.

Run the main file distilbertTraining.py using the command:

python distilbertTraining.py 2e-6 1

Once we successfully run the above command, mlruns folder, experiment Id folder, run Id folder with artifacts, metrics, and params will be automatically created as follows.

Experiment Comparison and Deployment

After logging multiple runs with different models or hyperparameters, you can compare their performance using the MLflow UI. This visual representation helps you make informed decisions about which model to choose for deployment.

We have trained two models with different learning rates 2e-5 and 2e-6. The learning rate 2e-5 is optimal and produced an accuracy rate of 93.5% whereas the learning rate 2e-6 produced just 63.17%.

Find the Best Model Using MLflow Tracking via Code

from mlflow.tracking import MlflowClient
import mlflow 
import numpy as np

# If you called your experiment something else, replace here
current_experiment=dict(mlflow.get_experiment_by_name("text_classification_ktrain"))
experiment_id=current_experiment['experiment_id']

# To access MLFlow stuff we need to work with MlflowClient
client = MlflowClient()

# Searches runs for a specific attribute
runs = client.search_runs([experiment_id])
print("total runs", len(runs), "experiment id", experiment_id)

# Select the best run according to test_accuracy metric
best_run = np.argmax([f.data.metrics['accuracy'] for f in runs])
best_auc = np.round(runs[best_run].data.metrics['accuracy'], 4)
best_runname = runs[best_run].info.run_name
best_runid = runs[best_run].info.run_id
print(f"Experiment had {len(runs)} rounds")
print(f"Best run name - {best_runname} with run id - {best_runid} has the accuracy of {best_auc}")

Finding the best model

We can cross-check here

Other than finding the best model, we can register the model and promote it to production through Python code.

When it comes to deployment, MLflow offers various options, including serving the model via REST API or integrating it into a web application.

MLflow documentation has a lot of examples and they can be viewed here.

You can get the complete code of streamlining text classification models with MLflow mentioned above from this GitHub repo.

Conclusion

MLflow provides a powerful toolset for building and managing ML models. By leveraging MLflow’s experiment tracking, model versioning, and collaboration features, data scientists and NLP practitioners can streamline their workflow, leading to more efficient and reliable ML models.

Remember, this blog only scratches the surface of what MLflow can do. As you dive deeper into the world of NLP and text classification, MLflow will undoubtedly become an indispensable part of your machine learning toolkit. Happy streamlining & classifying!

If you’ve found this article helpful or intriguing, don’t hesitate to give it a clap! As a writer, your feedback helps me understand what resonates with my readers.

Follow ScrapeHero for more insightful content like this. Whether you’re a developer, an entrepreneur, or someone interested in web scraping, machine learning, AI, etc., ScrapeHero has compelling articles that will fascinate you.