Deploying a machine learning model as an API with Datmo, Falcon, Gunicorn, and Python

Published in

Datmo

17 min readDec 7, 2017

How Datmo helps you wrangle the zoo of programming animals to create an easy-to-deploy machine learning API with an adaptable architecture.

What’s an API?

APIs (Application Program Interfaces) are software communication methods developed on a particular standard. Many companies have their own public API products that solve specific problems for developers —by passing their API a parameter as an input, the user can receive an output without needing to know (or understand) how the underlying task is done. APIs allow for cross-platform functionality, made possible by a platform agnostic standard, such as the REST specification. The beauty of APIs is that they’re super accessible — whether you are building a mobile app, IoT devices, a server, or simply want to find a way to have your own microservices communicate with one another, APIs help you accomplish this.
In addition to having the peace of mind that your API will work within any sort of app architecture you’ll potentially use, you can even choose to make it available to other users without requiring them to download or install any software.

Why should I build one for my ML model?

Now, when it comes to machine learning, AI, and computer vision, there are a lot of API solutions that enable you to pass data as an input and receive a prediction or classification as the result. This is often referred to as MLaaS (Machine Learning as a Service) or AIaaS (Artificial Intelligence as a Service). With a little bit of elbow grease and a cheap/free cloud platform account, you too can have your very own API that you control.
While these blackbox solutions make sense for certain projects where the user can find an API with the exact model they need, accepting the perfect types of inputs, and at the price they are willing to pay, it’s rare for all of these stars to align. More importantly, from a quantitative modeling perspective, it doesn’t allow you to train a custom model that is perfectly tailored towards your use-case.
The goal of this tutorial is to showcase how you can go from a local machine learning model to a deployed API, empowering yourself (or others) to develop smart applications that leverage the the ML and AI work you’re already doing without needing to localize the machine learning code.

Background

You’ll need:

*An Amazon Web Services account (A free tier account works fine)
A Datmo account (sign up for a free developer account here)

*Any cloud service provider can work here since Datmo is platform agnostic, but I’ll be providing instructions specifically for AWS during the server setup section.

The example model we’ll be using today:

Training on Fisher’s Iris Dataset, I’ve created a model using a Random Forest classifier with scikit-learn in Python3. You can check out the model training code in classifier.py, and the dataset in Iris.csv. You can read more about the Iris dataset here. All of the code in the example we’ll be using is included in the GitHub repo + Datmo model.

In order to use the model that our training code built, we need to save it in some form. For this, we’ll be using Python’s data serialization tool called `pickle`. Pickle encoding works well for saving models in Python because un-serializing this data structure (unpickling) happens easily and with the same module. For machine learning work in Python, pickling is preferable to JSON serialization because it is compatible with many more data structures and custom class types, and doesn’t have a need to be humanly interpretable in its serialized form.

Getting Started

There are three small python files that we’ll need to create here that will handle everything we need to make our model deployable.

predict.py — Given a set of unclassified data, perform a prediction from the saved model.

falcon_gateway.py — Accept API requests, API error handling, and serving API responses.

data_handler.py — Middleware that will serve as the handoff code between the falcon_gatewayand predict function, and handle any data transformations in between.

Here’s how all of the moving parts will work together (warning, big image):

X-Ray overview of the data states and app architecture

1. Writing a predict function for our model — predict.py

First we’ll need to write a function that can take an unclassified entry and perform a prediction on it. To do this, the script will need to rebuild the model in memory based on the pickle file (model.dat, in this case), and feed it a new entry to allow it to make a prediction. While it’s possible to retrain a model from scratch each time we want to make a prediction, this is incredibly resource intensive (especially in larger examples) and is a fundamentally different process from making a standalone inference, and as such, is very bad practice in machine learning.

Starting with the pre-existing model training code in classifier.py, I’ve written a predict function within a new file, predict.py(below), which enables me to take a set of flower measurements and make an inference on its class using the pickled model.dat model.

For this prediction, the model requires 4 numerical inputs (sepal_length, sepal_width, petal_length, petal_width — in this order) and returns a class prediction containing one of three species (Iris-setosa, Iris-versicolor, Iris-virginica). It’s important to understand the scope for your predict function, and assess what type of data (and format) it will accept. This will be important later when we write the data_handler.py file that will serve as the middleware between our API gateway and this predict function.

In the context of this model, our predict function assumes an input of a 1D pandas DataFrame with the values ordered based on the feature titles listed above in the commented out X_test object on line 6.

2. Setting up our Falcon API Gateway — falcon_gateway.py

For the API framework, I chose to go with Falcon instead of Flask because it is more performant (~4x faster for simple tasks, since it compiles to C at runtime) and has greater code readability since it was designed exclusively to build APIs. In addition to the performance benefits, in the context of a reader trying to understand an example someone else coded, the choice was clear.

In Falcon, there are direct and intuitive links between Python objects and API framework entities.

To define an API Resource, instantiate a new Python Class. In our example, I wanted two different resources, one for general info about the model, and another for predicts . Each will have its own behaviors defined within their respective classes.

To establish behavior for a particular type of request within a given Resource,(ex: GET), define a Python method. All API gateway logic for that request can be defined here. In this example, we tell Falcon to return a 200 response status code, with a body that consists of a string — some info about the API.

For our API to leverage our model, we’ll also need a predicts resource (class) that is able to accept POST requests (Python Method) in order to receive data and pass it along to other files/functions that will handle data massaging and prediction. Here’s the PredictsResource class. I also gave it a GET endpoint to describe the valid input and output schema for the Iris model.

Our main code for the Falcon gateway goes at the bottom. Lines 5 and 6 of the code snippet instantiate the Python classes for InfoResource and PredictsResource , while lines 8 and 9 define routes for the Falcon app — directing Python towards the behavior defined in the InfoResource for requests served at the /info endpoint, and the same for PredictsResource for /predicts requests.

The beauty of this architecture is that since this is a fairly standard schema for a machine learning prediction API Gateway, there isn’t much you’d need to change in this file beyond the names of the endpoints and the metadata they’re returning for the GET requests if you were to use a different model.

3. Writing our data handling middleware script — data_handler.py

This file is where a lot of the dirty work happens. The middleware’s role in our app is to handle data wrangling to make sure that our functional portions (API gateway, predictor) can be built in a robust and standard fashion. In our app, our data_handler.pywill deserialize raw API requests from the falcon_gateway.py , and convert them to the valid data structure for the predict function in predict.py . It will then immediately pass the model_usable_data into the predict function, and once it has a prediction (raw_model_output), perform the reverse operations to transform raw_model_output into a serializable structure, then serializing it in a way that the falcon_gateway can serve as a response.

4. Making our complete dev environment reproducible with Datmo

The beauty of Datmo is that we can rebuild our model and its environment as a perfect replica of the original. By leveraging containers, it handles code, package installations, as well as any other system level configurations we’ll need for our code to behave the same way elsewhere as it does locally.

Begin by going here to signup and install Datmo.

Once installed, we’re going to perform the following steps to create a Datmo snapshot.

$ cd your-model-folder
$ datmo init
$ datmo task run 'python3 classifier.py'
$ datmo task ls
$ datmo snapshot task --id [task-id]
$ datmo snapshot create

$ datmo init will create a new GitHub repo + Datmo model.

$ datmo task run will run the training code classifier.py .

$ datmo task ls will show us all tasks run with this model in Datmo.

$ datmo snapshot task --id [task-id] will designate the last task to be queued for the snapshot.

$ datmo snapshot create will create the snapshot, syncing your local with Datmo’s model as well as code storage repository on GitHub.

Now that our snapshot is created, we’re ready to start setting up the server for deployment.

Let’s do it live! (Deploying example using AWS)

Provision our Server

First, we’ll need to provision a server for our code to “live” in on the cloud. On AWS, this is accessible through an EC2 (Elastic Cloud Computing) instance.

When you sign into the AWS dashboard, you should see EC2 availablenear the top, as it’s one of AWS’ most popular services.

Click on the EC2 icon. If it’s not there, we can search for it by typing “EC2” into the search bar.

Then click on “Launch an Instance”.

When prompted to select an AMI, select Ubuntu 16.04.

We don’t need a particularly high performance instance for a personal API that uses a lightweight model, so we’re going to go with the t2.micro, which also happens to be free for the first year of your account (be sure to hover over the “free tier” emblem if you want more details).

Once t2.micro is selected, press Review and Launch, and then Launch.

You will then be prompted to create a key pair, one half of which is the private key you’ll need to send along with your access requests for the EC2 instance. You can select any name for your key pair, as it only will change the filename.

Select “Create a New Key Pair”, submit a name, and press “Download Key Pair”.

Once that’s downloaded, we should now be able to press “Launch Instances”, and then head back to our EC2 dashboard to see the status of our running instances.

Once you see that your Instance State and Status Checks are both running/passed, your instance is officially launched and ready!

Setting up our Instance’s Security Policies

Creating a new Security Group

Next up are setting permissions. Much like how you have a firewall and antivirus tools that prevent others from accessing your personal computer and doing some nasty things, it’s imperative for this same logic to be applied to our API server. Remember, as magical as it may seem, the cloud is just someone else’s computer :)

In AWS, this is accomplished through VPC Security Groups. When we provisioned our EC2 instance before, it was automatically put within a VPC (Virtual Private Cloud), which has a default set of access permissions defined in the policy for that particular security group.

EC2 instances are locked down by default for the sake of protection, so we’ll need to make ours available to our IP so that we can SSH into our server and start setting up our API. (Note: there are many different ways you can grant yourself permission, either through IP whitelisting, public/private keys, certificates, etc. For more info, check out the “Advanced Permissioning” section at the end of the blog post).

First, on the left hand menu under “Network & Security”, click on “Security Groups”.
Next, click “Create Security Group”. Add a security group name and description.

We’ll have to define rules for this group, specifically the policies that allow inbound and outbound traffic from our server. By default, the outbound traffic policy is fine here, as it allows for all types of traffic on all ports.

We’ll need to edit the Inbound traffic rules, though. We’ll be adding two custom rules here:

1: Allow `All Traffic` on port 8000. You can also set what locations/IPs to accept traffic from — Everywhere will make it completely public, while your own IP will make it the one you’re working from currently. You can choose to set more fine-grained options as well, but I suggest using open settings first to get it setup, and then nail down the security (based on your desired use cases) afterwards. While this isn’t ideal for security purposes, it prevents you from needing to debug multiple things at once.
2: SSH (secure shell) connection will be enabled on port 22, allowing us to remotely connect to our EC2 instance and setup our code and files. For convenience, we will change the source to anywhere. Note: this means that anybody is allowed to send SSH connection attempts to your server, however they will still require the private key supplied by AWS that we generated before.

Once your security group settings match the above layout, press “Create” or “Save”.

Pairing the EC2 instance with the new security group

Now that we have our instance provisioned, and a new security group setup, we’ll need to add the instance to that security group. Jump back over to “Instances” from the left menu, select the instance, and follow the “Actions” menu chain. Click on “Change Security Groups”.

By default, your instance will have the “launch-wizard-1” security group included, but we’ll also need to include the new group we just made — check the box and press “Assign Security Groups”.

Awesome, we’re finally done messing around on the AWS console. Your server (EC2 instance) is up and running, and you’ve set all the security policies that will enable us to both SSH into the server, as well as eventually serve up the API requests from our Falcon gateway.

Setting up our server from the inside

Establish an SSH connection with the server

Now that we’ve set our permissions, we can use the key (.pem file) that we downloaded to connect directly to the server over the command line. We’ll do this with a secure shell connection (SSH), which is a native utility in MacOS and Linux operating systems. If you’re on windows, check out this guide for installing PuTTy, an SSH client workaround.

The format for our SSH connection will be as follows:

$ ssh -i “<LocalKeyName>” ubuntu@<Public Server IP>

For example, if your key file was in “documents/keys/my_key.pem” and your server’s IP address was ec2–55–666–777–888.compute-1.amazonaws.com, your SSH request would look like the following:

$ ssh -i “documents/keys/my_key.pem” ubuntu@ec2–55–666–777–888.compute-1.amazonaws.com

Alternatively, you can find this by pressing “Connect” along the top of the EC2 Instances AWS Console dashboard.

Install Datmo’s CLI on the Remote Server

Because Datmo tracks everything that a model needs to be reproduced and run anywhere, we won’t need to spend any time manually installing environments, programming languages, or libraries/packages on our remote server.

Before we fetch the installer from Datmo, we’ll want to run the following commands to make sure our server is up to date:

$ sudo apt-get update
$ sudo apt-get install gcc
$ sudo apt-get install make

Next, run the following commands to fetch and install the Datmo installer.

$ curl -OL https://datmo.com/download/datmo-cli-ce-installer.sh 
$ sudo bash datmo-cli-ce-installer.sh

Note: Since the installer includes everything you’ll need to handle your environments, it’ll take a little while to download. While waiting for it to finish, be sure to follow any prompts it may give you along the way.

Bring the Model to the server:

Jump on over to Datmo’s web platform and sign in. We’ll first have to fork the example model at https://datmo.com/nmwalsh/datmo_falcon_api . Once the forked model appears on your account, you’re ready to clone it to the server.

In the terminal you’re using to SSH into the remote server, type:
$ datmo clone YOUR_USERNAME/datmo_falcon_api

Deploying your Model:

While the following deployment commands are quick and easy, there’s a lot of power packed inside.

$ screen
$ cd datmo_falcon_api$ datmo task run “/usr/local/bin/gunicorn — access-logfile — -b 0.0.0.0:8000 falcon_gateway:app” --port 8000

Some notes on the deployment commands:

The $ screen command allows our process to run detached from the current session, so if we close our SSH connection, the API process will not terminate.

--access-logfile is for enabling standard output for connection logging in gunicorn, which is disabled by default. This enables you to see running logs of connections that are made to your API, and their corresponding responses!

-b 0.0.0.0:8000 binds the gunicorn WSGI server interface to port 8000 of the 0.0.0.0 localhost on the container that the model (and gunicorn) are running in within the EC2 instance. This is designating which port gunicorn should be outwardly looking toward to receive a data stream from outside of the container.

falcon_gateway:app allows gunicorn to bind our falcon python app as the official WSGI app which will serve requests and responses to/from the server. Gunicorn and Falcon are a dynamic duo!

The--port 8000 flag is run to open up port 8000 on the docker container, so that it can receive requests from the physical server and return them as well. This is the outward facing port of the container that Gunicorn is also set to be listening on.

CTRL + A + D pushes the current process (the gunicorn app and all of the internal ML code) to the background so that we can continue to do other things on the server without needing to terminate the current process.

Testing it out:

In our falcon_gateway.py file, we defined our endpoints and the proper responses. There are two GET endpoints and one POST endpoint.

First, let’s test the GET endpoints with:

curl <Your Public Instance IP>:8000/info
curl <Your Public Instance IP>:8000/predictsAlternatively, you can check directly in the browser with:http://<Your Public Instance IP>:8000/info
http://<Your Public Instance IP>:8000/predicts

Now, let’s perform an inference using the POST endpoint:

curl http://<Your Public Instance IP>:8000/predict -L -X POST -d ‘{“sepal_length”: [6.9], “sepal_width”: [3.2], “petal_length”: [5.7], “petal_width”: [2.3]}’ -H ‘Content-type: application/json’

For our live example, this looks like the following:

curl ec2–54–183–245–15.us-west-1.compute.amazonaws.com:8000/predicts -L -X POST -d ‘{“sepal_length”: [6.9], “sepal_width”: [3.2], “petal_length”: [5.7], “petal_width”: [2.3]}’ -H ‘Content-type: application/json’

Congrats! You now have your own machine learning model deployed as an API on a remote server!

Adapting this boilerplate to your model:

1) In the falcon_gateway.py file, you’d need to change the info you want your API to respond with at GET /info or GET /predicts.
2) your predicts.py file, while straightforward, is contingent on your specific model. There’s no way to have a catch-all here, you’ll just need to ensure that based on some data structure, you can feed that into your model and get a prediction output. This is typically identical to the format your data takes when it is used for training after pre-processing is done.
3) data_handlers.py is where the majority of the changes would need to happen. Because there are many different permutations of potential API request methods, model inputs, model outputs, and API responses, data_handlers needs to take care of playing the middleman and being able to take you from your API request to a data structure that is ingestible by predict, and vice versa.

As an example:
Let’s say you have an image identification model that takes an image as an input and returns a string for what it thinks is in it. You can send a URL (or directly as bytes) as an input in the API request, and data_handlers would then need to fetch the file, perform any size or color transformations necessary for your model to use it, and then call the predict function on that transformed image.

Going Forward:

Better Model Deserialization Solution: While in our example we unpickle the model.dat file each time we make a prediction, this isn’t best practice for two reasons. 1: For larger models (especially neural nets), this operation can become computationally intensive and cause noticeable request latency on each request. 2: For a large number of requests, this can cause unnecessary computational strain on your system, further exacerbated for large networks mentioned in issue number 1. The solution to this problem would be to have model persistence in memory on app instantiation. This can be solved by unpickling the model at the bottom of the falcon_gateway.py file, which is called once on app startup, as opposed to data_handler.py and predict.py which are invoked on each API request.
Advanced Permissioning: Edit your cloud security group in AWS to enable more nuanced permissions.
Develop Tests/Health Checks: Postman is an amazing tool that enables you to quickly and easily send requests to RESTful APIs in a non-programmatic fashion. Once downloaded, you can save a collection of example GET/POST calls for testing your API’s functionality when you make changes, as well as have health checks set up to alert you in case your API goes down. Postman is also useful for messing around with other APIs once you’ve acquired your access key, without needing to write any code.
Scaling up infrastructure: Services like AWS Elastic Beanstalk allow for automated resource provisioning, so that you can reserve (and pay) for only the resources you need.
Advanced Error Handling: There are a lot of different situations in which the API in its current form will not work. By developing more specific error responses, we can help the user understand what they need to change in order to properly utilize the resource. Situations that come to mind include: not passing all four required inputs for the model, using datatypes that aren’t accepted, or invalid data structure (non-JSON) passed with the -d flag during POST request.

I hope you enjoyed this exercise in turning your ML model into a personal API. At the enterprise level, there are countless additional constraints you’d need to worry about from model versioning, building, deploying, and post-deployment monitoring, all of which we solve for in Datmo’s enterprise version.

If you’re a company looking to deploy machine learning to production with an end-to-end pipeline, feel free to reach out, we’d love to chat and show you a demo.

Welcome to Datmo — your AI workflow, simplified. Check out a live demo of our tool and sign up for free on our website.

If you like what we’re doing at Datmo, show us some love by clapping and sharing the story!