Building a Machine Learning Text Classifier Backend for a PTSD Assessment Chatbot

MLFlow to structure a Machine Learning project and support the backend of the risk classifier chatbot.

Published in

Omdena

7 min readSep 4, 2019

Problem & Background

This is my second article around my experience in the Omdena PTSD Challenge. My previous article dealt with the problem context and the application of transfer learning on a classifier. This article focuses on machine learning.

To support the chatbot developed by a separate team, led by Petar King, we needed a backend to serve the most successful model developed by the modeling team. Given my previous experience in this space, I decided to explore MLFlow.

As depicted by the adapted diagram the user would interact with a chatbot on a channel (e.g. WhatsApp) and the conversational logic of this project is handled by a library called Rasa. To cold start the system we were considering using humans as part of the conversation backend. At the end of the interaction, the backend would have produced an entire conversation suitable to be classifier previously trained by our classifier.

The input

A text transcript similar to:

The output

Low Risk -> 0 , High Risk -> 1

One of the requirements of this project was to have a productionized model for classification that could communicate with a frontend.

As part of the solution to this problem, we decided to explore the MLFlow framework.

MLFLow

MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. It currently offers three components:

MLFlow Tracking: Allows you to track experiment and projects.

MLFlow Models: Provides a model and framework to persist, version and serialize models in multiple platform formats.

MLFlow Projects: Provides a convention-based approach to set up your ML project to benefit the maximum work being put in the platform by the developer's community.

Main benefits identified from my initial research were the following:

Work with any ml library and language
Runs the same way anywhere
Designed for small and large organizations
Provides a best practices approach for your ML project
Serving layers(Rest + Batch) are almost for free if you follow the conventions

Solution

The focus of this article is to show the baseline ML models and how MLFlow was used to aid in training model experiment tracking and productionization of the model.

Installing MLFlow

pip install mlflow

Model development tracking

The snippet below represents our cleaned and pretty data, after data munging:

In the gist below a description of our baseline(dummy) logistic regression pipeline:

One of the first useful things that you can use MLFlow during model development is to log a model training run. You would log for instance an accuracy metric and the model generated will also be associated with this run.

At this point, the model above is saved and reproducible if needed at any point in time.

You can spin up the MLFlow tracker UI so you can look at the different experiments:

╰─$ mlflow ui -p 60000                                                                                                                                                                                                                  130 
[2019-09-01 16:02:19 +0200] [5491] [INFO] Starting gunicorn 19.7.1[2019-09-01 16:02:19 +0200] [5491] [INFO] Listening at: http://127.0.0.1:60000 (5491)[2019-09-01 16:02:19 +0200] [5491] [INFO] Using worker: sync[2019-09-01 16:02:19 +0200] [5494] [INFO] Booting worker with pid: 5494

The backend of the tracker can be either the local system or a cloud distributed file system ( S3, Google Drive, etc.). It can be used locally by one team member or distributed and reproducible.

The image below shows a couple of models training runs in conjunction with the metrics and model artifacts collected:

Once your models are stored you can always go back to a previous version of the model and re-run based on the id of the artifact. The logs and metrics can also be committed to Github to be stored in the context of a team, so everyone has access to different experiments and resulted metrics.

Now that our initial model is stored and versioned we can assess the artifact and the project at any point in the future. The integration with Sklearn is particularly good because the model is automatically pickled in a Sklearn compatible format and a Conda file is generated. You could have logged a reference to a URI and checksum of the data used to generate the model or the data in itself if within reasonable limits ( preferably if the information is stored in the cloud).

Setting up a training job

Whenever you are done with your model development you will need to organize your project in a productionizable way.

The most basic component is the MLProject file. There are multiple options to package your project: Docker, Conda, or bespoke. We will use Conda for its simplicity in this context.

name: OmdenaPTSD

conda_env: conda.yaml

entry_points:
  main:
    command: "python train.py"

The entry point runs the command that should be used when running the project, in this case, a training file.

The conda file contains a name and the dependencies to be used in the project :

name: omdenaptsd-backend
channels:
  - defaults
  - anaconda
dependencies:
  - python==3.6
  - scikit-learn=0.19.1
  - pip:
    - mlflow>=1.1

At this point you just need to run the command.

Setting up the REST API classifier backend

To set up a rest classifier backend you don’t need any job setup. You can use a persisted model from a Jupyter notebook.

To run a model you just need to run the models serve command with the URI of the saved artifact:

mlflow models serve -m runs://0/104dea9ea3d14dd08c9f886f31dd07db/LogisticRegressionPipeline2019/09/01 18:16:49 INFO mlflow.models.cli: Selected backend for flavor 'python_function'2019/09/01 18:16:52 INFO mlflow.pyfunc.backend: === Running command 'source activate mlflow-483ff163345a1c89dcd10599b1396df919493fb2 1>&2 && gunicorn --timeout 60 -b 127.0.0.1:5000 -w 1 mlflow.pyfunc.scoring_server.wsgi:app'[2019-09-01 18:16:52 +0200] [7460] [INFO] Starting gunicorn 19.9.0[2019-09-01 18:16:52 +0200] [7460] [INFO] Listening at: http://127.0.0.1:5000 (7460)[2019-09-01 18:16:52 +0200] [7460] [INFO] Using worker: sync[2019-09-01 18:16:52 +0200] [7466] [INFO] Booting worker with pid: 7466

Voila a scalable backend server (running gunicorn in a very scalable manner) is ready without any code apart from your model training and logging the artifact in the MLFlow packaging strategy. It basically frees Machine Learning engineering teams that want to iterate fast of the initial cumbersome infrastructure work of setting up a repetitive and non-interesting boilerplate prediction API.

You can immediately start launching predictions to your server by:

curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '{"columns":["text"],"data":[[" concatenated text of the transcript"]]}'[0]%

The smart thing in here is that the Mlflow scoring module uses the Sklearn model input ( pandas schema) as a spec for the Rest API. Sklearn was the example used in here it has bindings for (H20, Spark, Keras, Tensorflow, ONNX, Pytorch, etc.). It basically infers the input from the model packaging format and offloads the data to the scoring function. It’s a very neat software engineering approach to a problem faced every day by machine learning teams. Freeing engineers and scientists to innovation instead of working on repetitive boilerplate code.

Going back to the Omdena challenge this backend is available to the frontend team to connect at the most convenient point of the chatbot app to the risk classifier backend ( most likely after a critical mass of open-ended questions).

Some final comments

Pros:

Lightweight compared to other opinionated tools (example: PredictionIO, Kubeflow).
Work with any ml library and language.
Supports training, serving of multiple model types (TensorFlow, h20, custom made, etc.) and in multiple environments ( sage maker, azure ml, databricks)
Good ideas in terms of standardization of ML pipelines end to end.

Cons:

Still in the early days ( version 1.x) of MLFlow.
Not a complete machine learning platform ( an orchestrator is still needed to handle data preparation phases ) and trigger the training process.

Next steps

Implement champion-challenge pipeline with the state of the art classifier described[1]
Deployment scalability
Running multi-parallel pipelines
Improve deployment approach

If you want to join one of Omdena’s challenges and make a real-world impact, apply here.

If you want to receive updates on our AI Challenges, get expert interviews, and practical tips to boost your AI skills, subscribe to our monthly newsletter.

We are also on LinkedIn, Instagram, Facebook, and Twitter.