MLOps at Edge Analytics | Model Deployment

Part Five of Five

Published in

Edge Analytics

7 min readMay 5, 2023

As machine learning models become more widely deployed, ML practitioners have shown increasing interest in MLOps. In our introductory blog, we give a brief background on how we think about MLOps at Edge Analytics.

So far in this series we have stored raw data and curated methods for accessing it, processed the data in preparation for modeling, trained and tuned a variety of predictive models, and tracked those models to find the ones that perform best. The last step we cover is how to get those models into the hands of users. This post is the last part of a five part series.

You can find the other blogs in the series by following the links below:

Part 1: Data Storage with AWS S3 and Boto3
Part 2: Data Processing Pipeline
Part 3: Model Development
Part 4: Model Tracking
Part 5: Model Deployment

Running in a Jupyter notebook is typically not the end state of a trained model. To be useful in the wild, it must be deployed somewhere for inference. Where and how the model is deployed depends on how it is used. Some general questions to consider when figuring out how to serve a model to users include:

Who needs access to the model outputs?
How much does inference time matter?
How often does the model need to be run?
Where will the model be hosted?

There are many ways to package and deliver a TensorFlow, PyTorch, or SKLearn model, including solutions from a variety of cloud providers. When a user needs occasional access and has an active internet connection, we often set up an API endpoint on AWS Lambda. If we need a GPU for faster inference, we may also use AWS SageMaker. When the user needs continuous input and real-time output on an edge device, we might use TensorFlow Lite to package the model and deploy it to a microcontroller. We will briefly touch on each of these three deployment methods in this blog, with more focus on deployment using AWS SageMaker.

Model deployment in AWS Lambda

AWS Lambda is an event-driven, serverless compute resource. When creating an AWS Lambda function, the developer identifies a particular event that will trigger a provisioning of resources on AWS to run a model inference function. Because you are only charged for the time it takes to run your hosted function, this serverless method keeps running costs low for models that are only occasionally accessed. Lambda also handles scaling, memory allocation, and security, making it a safe and easy-to-use model deployment resource. One drawback is that Lambda does not currently offer GPU resources, which can be a bottleneck for model inference. See this blog we wrote for an in-depth look at getting started with AWS Lambda!

Model deployment in AWS SageMaker

Unlike AWS Lambda, with AWS SageMaker we can provision persistent instances to run models. These instances are always running, so you can avoid the cold start you may observe with code changes or updates to an AWS Lambda function. SageMaker offers many of the perks of AWS Lambda but with the added bonus of being a model-first service and providing large, GPU-based instances as options.

Recall that for our example MLOps pipeline, we used the Blood Cell Images dataset from Kaggle to generate a model that predicts cell type from a microscope image. We can think of deploying this model to SageMaker in three steps:

Formatting the model for deployment
Deploying the model
Testing the deployed model

Here we take a closer look at each of these steps.

Step 1: Formatting the model for deployment

Formatting your model for deployment will vary depending on which deep learning framework you choose. Although there are several file formats for saving a TensorFlow model, SageMaker requires the SavedModel format. The four components of the SavedModel for our example look like this:

variables 
    |--variables.data-00000-of-00001
    |--variables.index
keras_metadata.pb
saved_model.pb

If the ModelCheckpoint callback was used in your TensorFlow model training script, it will automatically save models using this layout. Conveniently, if you used Weights and Biases for model tracking and included the WandbModelCheckpoint, the checkpoint models will also be saved with this format. Otherwise, if the model is saved in a different format, such as an H5 file, you will need to load and save it again:

Converting a TensorFlow model file to the proper format for deployment.

Once you have the model files in the proper format, they need to be placed in a larger folder structure for export to the SageMaker endpoint. The overarching file structure should look like this:

export_folder 
    |--001 
        |--variables 
            |--variables.data-00000-of-00001
            |--variables.index
        |--keras_metadata.pb
        |--saved_model.pb

You can give the “export_folder” any name you want, but the naming conventions for its contents are more strict. Importantly:

The model files, especially the “variables” files, are sensitive to name changes and should be named as shown above.
The folder containing the model files (in this example, the “001” folder) should be named with only the model version number. Even changing the name to “model_001” may cause the model deployment to fail.

Once the model has been properly saved in the “export_folder,” the folder needs to be converted to a tar.gz file. This is done easily with the tarfile package in Python:

Saving a TensorFlow model as a tar.gz file.

Step 2: Deploying the model

If the setup in Step 1 has been done correctly, actual deployment to SageMaker is simple and can be done in four lines of code using the Python SageMaker SDK.

Deploying a TensorFlow model to a SageMaker endpoint.

A few notes about the last two lines of code:

For the TensorFlowModel instance, make sure you know what version of the TensorFlow package was used to create the model. This will be your framework_version argument.
The variable my_sagemaker_role can be found by calling get_execution_role() from the SageMaker SDK.
Once the model is deployed, it will run constantly on the number and type of instances that you specify. Make sure to select an appropriate initial_instance_count and instance_type so you don’t incur unexpected costs. Here is a list of SageMaker instance types and their prices.

Step 3: Accessing the deployed model

With the model deployed to a SageMaker endpoint, we can now access it to make live predictions on new data. First, make sure to preprocess the same way you did for the training data. To do this, we recommend using the same data processing pipeline that was set up prior to model development.

There are three main methods for sending images to the model endpoint and retrieving results. We can use the SageMaker SDK, the Boto3 SDK, or the CLI. Each method follows the same general process:

Serialize the input images.
Send the images to the model endpoint.
Receive and parse the output.

With the SageMaker SDK, the first and third steps are mostly handled by functions in the package. For example, we can import a json_serializer and json_deserializer from the SDK. When included as arguments for the Predictor class, these will serialize/deserialize data to/from the model endpoint. Remember that if our input images are numpy arrays, they need to be converted to Python lists to be JSON serializable.

Accessing a SageMaker model endpoint with the SageMaker SDK.

Using the Boto3 SDK requires us to serialize the input and deserialize the output. Also, recall from our data storage blog that the Boto3 client returns metadata along with the model response, so the output needs to be parsed.

Accessing a SageMaker model endpoint with the Boto3 SDK.

To use the CLI, the serialized input images need to be saved to disk (e.g. as input_images.json). Similarly, the model response will be written to disk before it can be parsed (e.g. as model_response.json). Then, when running the CLI from the same directory as the saved input_images.json, we call:

Accessing a SageMaker model endpoint with the CLI.

All three of these SageMaker access methods are simple and effective. You should use the one that is most appropriate for your application!

Model deployment to the edge

Low powered edge devices, including wearables and smartwatches, are ubiquitous. Many wearables benefit from models making predictions on the hardware itself. Deploying models to an edge device, like a microcontroller, is a very different process from cloud deployment. You’ll likely be working in a low level programming language like C, C++, or Rust, and there may be specific constraints related to the ARM architecture or firmware compiler. Memory and power constraints also range from moderate to severe; thinking about RAM in terms of kilobytes or even bytes is commonplace. Managing data inputs and outputs also requires careful attention.

In general, deploying models to microcontrollers is a bespoke process that depends heavily on the hardware and firmware setup. Someone experienced in working with low level languages for embedded devices is necessary. We’ve had success writing our own C implementations of algorithms but have also found that TensorFlow Lite for Microcontrollers can reduce development cycle times once the libraries are integrated into the code base.

What now?

The raw data has been stored, and data I/O methods have been developed. The data has been processed and used to train and tune a large number of models. Those models have been logged and tracked to find the most effective one. Now, that selected model has been deployed and is ready to be integrated into your application.

However, the work of MLOps doesn’t stop here. The deployed model still needs to be rolled out to your users. It must be monitored to ensure there are no performance issues or unexpected outputs. And as new data comes in, it should be stored, processed, and used to tune your models. The need to maintain and improve ML models is constant, but as this work continues, we hope you return to the five pillars of MLOps we’ve explored in this series.

Machine learning at Edge Analytics

Edge Analytics helps companies build MLOps solutions for their specific use cases. More broadly, we specialize in data science, machine learning, and algorithm development both on the edge and in the cloud. We provide end-to-end support throughout a product’s lifecycle, from quick exploratory prototypes to production-level AI/ML algorithms. We partner with our clients, who range from Fortune 500 companies to innovative startups, to turn their ideas into reality. Have a hard problem in mind? Get in touch at info@edgeanalytics.io.