How to Train a Machine Learning Model as a REST API and Build a Spam Classifier (Part 1)

DataStax
Building Real-World, Real-Time AI
9 min readJan 31, 2022

Author: Pieter Humphrey

Banner image of laptop threatened by a spammer.

In this tutorial, Part 1 of a 2-part series, we show you how to create a machine learning model, train it, and turn it into a RestAPI with Astra DB, a serverless, managed Database-as-a-Service built on Apache Cassandra®.

In this step-by-step video tutorial taught by Coding for Entrepreneurs’ Justin Mitchel and sponsored by DataStax, you will learn to create a machine learning model, train it, and turn it into a REST API. Whether you are new to machine learning or keen to learn best practices, this is a fantastic hands-on project to build a custom machine learning model from scratch.

This is the first part of a two-part series covering how to:

  • train and build an AI model
  • integrate a NoSQL database (inference result storing)
  • deploy an AI Model into production

In Part 1, you’ll create and train a spam detection machine learning model from scratch and turn it into a production-ready REST API. In Part 2, you’ll actually deploy it into production.

Image of a video demo showing deployed version on ngrok predicting spam or ham (not spam).
Figure 1. Demo showing deployed version on ngrok predicting spam or ham (not spam).

With deep learning and AI, handling spam content has gotten easier and easier. Over time (and with the aid of direct user feedback) our spam classifier will rarely produce erroneous results.

This tutorial build uses the following technologies:

  • Astra DB, DataStax’s managed Apache Cassandra database-as-a-service
  • Jupyter Notebook, a web-based interactive computing platform
  • Google Colaboratory, a hosted Jupyter notebook service to write and execute arbitrary python code
  • Keras, a high-level, deep learning API developed by Google for implementing neural networks
  • FastAPI, a popular web framework for developing REST APIs in Python based on pydantic
  • pydantic, a library for data parsing and validation

The requirements

We assume you already have experience with Python. If you don’t, check out Coding for Entrepreneurs’ 30 days of Python series. If you already have a solid foundation in Python, you can jump straight in. If not, our video tutorial will show you how to do the build, step-by-step.

All of our code is on GitHub: AI as an API and the AI as an API Course Reference, which is updated over time.

We will use DataStax Astra DB to automatically connect to Cassandra, which gives us up to 80 gigabytes for free. Start by signing up for an account here. You can also go through the checklist of what you need to download and install in this requirements.txt on GitHub.

Set up a project on VSCode and prepare your datasets

The first step is to configure our project using Python and Visual Studio Code. We used Python 3.9 in this tutorial and recommend that you use at least 3.6 and above. Download both Python and VSCode and save the project in a folder. Then, create and activate a virtual environment. If the name of the folder comes out in parenthesis after the code, you’ll know that it has been activated.

Once that’s ready, here is an overview of how to set up, prepare, and export your datasets to prepare your training data for the machine learning model. We walk you through these steps in detail in our video tutorial.

Table showing dataset labels turned into number representations, then into vectors.
Figure 2. Turning dataset labels into number representations, then into vectors
  1. Prepare datasets: In order to create our AI algorithm, we have to start with a dataset. We will use a dataset from the machine learning repository of the University of California at Irvine (UCI). UCI has all kinds of open source datasets we can use in our projects. Download it by following the guide to this process on AI as an API GitHub.
  2. Download and unzip datasets: Our preferred method is to automate as much as possible. When you experiment with building the actual AI models, you need an easy way to grab these datasets from anywhere without extra configuration. We show you how to do this on Jupyter Notebook and how to create a base directory using Python here.
  3. Extract, review, and combine dataset: Using Python pandas, input the code to review and combine your datasets, then export it, with these step-by-step instructions. Once you’ve exported the dataset, you’ll see all the data on VS Code–about 7,000 texts from UCI’s SMS Spam Collection Dataset. There might be some duplicated texts, but don’t worry about it as the dataset is still small. For now, we just need to prepare this data set for training purposes by turning them into vectors.
  4. Convert dataset into vectors: Much of machine learning is based on linear algebra, which means we work with matrix multiplication and vectors. The challenge of matrix model multiplication is that our data has to be the same sequence length. We then use Keras to convert our dataset into vectors, which we explain in our tutorial.
  5. Split and export vectorized dataset: Next, split up and export your training data to have variants in your training data. So, for this step, imagine that you are playing a game with a friend: that friend beats you for the first 10 games using the same strategy. If he never changes his strategy, you will eventually learn his moves and beat him. Machine learning is the same analogy. We can’t focus too much on one way of being “correct”. Split up data as much as possible in the early days of building out this algorithm. You’ll find all the codes and instructions for this on GitHub and in our video tutorial.

Training our machine learning model

Now that we’ve prepared our dataset, it’s time to train our AI Spam Classifier. This blog post from Coding for Entrepreneurs will have the most up-to-date information on the actual classifier training itself. We recommend that you launch this notebook on Google Colab, since it offers free GPUs, and copy it to your drive so you can make changes to it.

Image of coding for the Long Short-Term Memory (LSTM) machine-learning model.
Figure 3. The Long Short-Term Memory (LSTM) machine-learning model.

On Keras, you will find substantial documentation created by other great machine learning data scientists, who spend considerable time to find the best models for any given problem area. The model we are using is the Long Short-Term Memory (LSTM), a model that is common for text-related data created by Keras. LSTM is one of the best models for cross-category classification, which works for two labels or more, like our spam classifier which has only two labels, spam or ham.

Once you’re done with training the model, which we outlined in our video and on our post, it’s important to upload your model, tokenizer and metadata to object storage. Since our machine learning model is going to get massive, cloud-based services are best for storing almost an unlimited amount of data. You can set up two object storage providers, Linode and Digital Ocean, and upload these three files to one or both of them:

  • the spam classifier metadata
  • the spam classifier tokenizer
  • the H5 spam model (not the CSV or pickle file)

The next step is to implement a pipeline and a script on the cloud storage provider you chose to download the same three files in a repeatable manner. Make sure you have pypyr and boto3 installed. Click here for the code and follow the video for instructions and examples.

Configuring FastAPI

Now, let’s turn to the API portion of this project. There are three things you’ll need to do:

  1. Create a base directory for FastAPI.
  2. Load it in the Keras model and the tokenizer on FastAPI.
  3. Implement the predict method with Keras and FastAPI.

Once those steps are executed, we are going to create a reusable AI model class. This process is tedious because we are re-implementing what we already did, so actually doing this re-factoring or watching how it’s done is not necessary. But refactoring can improve the data class or model. If you want to focus on deploying an AI model into production as soon as possible, then skip this part.

After you’ve configured your base FastAPI app and load your Keras model and predictions, you’ll have a machine learning production-ready REST API.

Setting up AstraDB and integrating Cassandra

This REST API has three primary purposes:

  1. provide accurate predictions on whether a string of text is spam or not
  2. improve the conditions for the model to make better predictions
  3. deploy the model openly for production for other applications and users around the world

To improve conditions of the model, you’ll store your spam or ham queries which are up to 280 characters maximum, and some of the prediction results on the NoSQL database Cassandra, managed by DataStax through the Astra DB service.

After signing up for your free Astra DB account, you’ll be able to store up to 80 gigabytes of free data monthly and will get the API keys you need to configure your environmental variables with these step-by-step instructions.

Then, install pydantic for FastAPI. By default, all your configurations for environment variables and others that you might want on your project are included, so you’ll just need to create base settings for pydantic. Watch how to do this here.

Once your project is ready to integrate with the Astra DB database, jump into the Astra DB console, create a database, and download the Cassandra driver as a secure bundle. Then, configure your Astra DB cluster and session with these instructions.

Then, let’s go ahead and create the Cassandra model to store your data: the first is the actual query that’s been used in our inference data in Figure 4. The second is the inference/prediction itself. There are only two labels in this case, and you’ll just need to store one of them — the label for “ham” and the confidence ratio. Watch the video here to create the Cassandra model and here to store inference data in the model. Finally, you’ll also want to paginate the Cassandra model to see all the data objects that you’ve listed in there.

Image of the inference data to be stored on Cassandra.
Figure 4. Inference data to be stored on Cassandra.

By now, you would have an amazing way to extract a good chunk of the data that you stored in your Astra DB database. Cassandra offers eventual consistency, meaning our entire cluster of dataset will eventually have the correct data in it because it stores data using primary keys.

What we really love about Astra DB is that we can add any fields and data that we need, really fast. It would be interesting to test your limits (if you’re comfortable) and see how fast Astra DB would respond if you put in a million entries. Let us know in the comments if you tried it!

Testing it out

Before you go into production, let’s test your AI as an API through ngrok. This emulates a production environment by exposing a local project to the world so you can test it in Google CoLab. You’ll also need to add in a schema with pydantic to get the correct data coming through from your post data. Watch the video to test out your AI as an API.

Conclusion

If you’ve followed this article and along with our video tutorial, you now know how to:

  • prepare datasets for machine learning using Python, Jupyter Notebook, and Keras
  • train a Long Short-Term Memory (LSTM) machine learning model using Keras
  • convert your machine learning model into production-ready REST API with FastAPI
  • integrate your AI as an API with Astra DB, manage Apache Cassandra database, and test it out with ngrok

But it doesn’t end there. In Part Two of this series, you’ll deploy this application from scratch on a virtual machine into production so that anyone in the world can use it. This is a bit different from a standard Python web application because of the nature of our machine learning model. We’ll go through everything that you need to ensure that your model runs perfectly, so stay tuned! In the meantime, check out our blog series on Real-World Machine Learning with Apache Cassandra and Apache Spark.

Follow the DataStax Tech Blog for more developer stories. Check out our YouTube channel for tutorials and here for DataStax Developers on Twitter for the latest news about our developer community.

Resources

  1. DataStax: The Open Stack for Modern Data Apps
  2. Astra DB: Multi-cloud Database-as-a-service Built on Cassandra
  3. Astra DB Sign Up Link
  4. Apache Cassandra: Open-source NoSQL Database
  5. DataStax Medium
  6. DataStax YouTube Channel
  7. DataStax Developer Twitter
  8. DataStax Cassandra Developer Workshops
  9. DataStax Academy: Apache Cassandra Course
  10. Real-World Machine Learning with Apache Cassandra and Apache Spark Part 1
  11. Real-World Machine Learning with Apache Cassandra and Apache Spark Part 2
  12. YouTube Tutorial: AI as an API Part 1
  13. Coding for Entrepreneurs
  14. Coding for Entrepreneurs YouTube Channel
  15. Coding for Entrepreneurs GitHub
  16. AI as an API GitHub
  17. AI as an API Course Reference GitHub
  18. 30-Days of Python Series

--

--

DataStax
Building Real-World, Real-Time AI

DataStax provides the real-time vector data tools that generative AI apps need, with seamless integration with developers' stacks of choice.