Part 4: Deploying the Model to Serve X-Ray Diagnosis in Production

Alfonso Santacruz Garcia
8 min readDec 17, 2021

--

This is the fourth article in a five-part series on Using Computer Vision and NLP to Caption X-Rays.

The goal of this project aims to measure the similarity of machine-predicted captions to actual captions provided by doctors. Our process has been broken down into the following topics:

The code is hosted and useable at this GitHub repository.

Figure 1. Analyzing X-ray images. Photo by Jonathan Borba on Unsplash

In general, companies don’t care about state-of-the-art models, they care about machine learning models that actually create value for their customers. -Tyler Folkman-

The purpose of your ML model

Unless a machine learning model is originally meant to be experimental or merely a proof of concept, it is expected that such model makes it out of the notebook and attains its full potential. Moreover, companies generally want these findings to actually be deployed in production, become a feature in an application and create value for all the parties involved, especially if (let’s face it) it means generating revenue. With this in mind, it becomes relevant to initially construct the model with an intended purpose, embedded in a data pipeline that makes it is easy to re-deploy, maintain and scale.

Model Deployment

There are multiple ways to deploy a machine learning model, all of them with their multiple tradeoffs. For example, the model can be embedded inside the application codebase and, with the appropriate files that load the model weights, it can serve predictions as any other application function, route or API endpoint. This method usually entails low latency yet high memory use, which makes it less popular for ML features in mobile applications. Since the predictive model is already encapsulated in the application, usually frameworks that efficiently scale the application will have no problem on scaling the predictive features.

Another example is deploying the model as a serverless API endpoint. Disregarding the buzz words, this simply means that the model and its predict function will be supported in the cloud and the endpoint will simply be a URL instead. A request to such URL, containing some inputs as a payload, will trigger a handler function that is declared inside the serverless function interface. That handler will basically reconstruct the model with its weights, use the input to generate a prediction and then return the result. Applications using this approach tend to require a stable internet connection, since it implies making API requests to an endpoint in the cloud. As a result, this avenue usually implies high latency with low memory use. Compared to the strategy described in the previous paragraph, this one might require a more complex scalability. In principle, and even though several optimization can be made, multiple requests to the same endpoint in a short period of time could overload its memory and/or CPU capacity, resulting in rejected requests.

Exporting the model weights

For this article, we decided to create a serverless API endpoint to generate captions from X-Ray chest images in production. Once the model has been trained, its weights and checkpoints are exported. Some cases are more straightforward than others. For example, there is substantial documentation on how to export a Keras model and save its weights after some training. This is also applicable for loading the weights back to another model instance. Nevertheless, each one of these set approaches is more suitable for different cases. In our specific scenario, our encoder and decoder models were subclasses (submodels) from standard Keras models (`tensorflow.keras.Model`).

Even though it is more comfortable to just export the models as serialized files and then load them back somewhere else, the best approach we found was to save the models’ weights as binary files into Google Drive as seen in the code snipped below from a Google Colab implementation:

decoder.save_weights('/content/drive/My Drive/Colab Notebooks/XRAY/decoderRNN_model/', save_format = 'tf')encoder.save_weights('/content/drive/My Drive/Colab Notebooks/XRAY/encoderCNN_model/', save_format = 'tf')

Recreating the model with the exported weights

Once the models are safely stored, we can modify the share link for each one of these binary files for it to be direct download upon request. There are multiple websites that are capable of producing such links out of Google Drive URLs given that the file contains the right sharing permissions already. This allows us to make requests to these open and direct download links and read the content from the binaries, which is then loaded into the models. An example of a function that does this can be found here.

It is important to remark that the model objects need to be declared and initialized in the code, such that those class objects exist and accept the content from the binaries as their weights. An example of the model for our specific case can be found here.

Now that we have our models instantiated and loaded with our training weights, we need a function that can take an image as an input and use the models to generate a prediction. This process might imply the input image to undergo some processing and feature extraction to suit our encoder and decoder inputs. You can take a look to an example of a self-made predict function for our project here.

Last, all the process described above needs to be abstracted and encapsulated in such a way that a single handler function that takes the request parameters and its payloads as an input. This handler concurrently executes the functionality described above and returns the results of the prediction. In addition, it is a best practice to allow this handler to do sanity checks on the input. What do we mean by this? It mainly entails checks and conditions to return errors in case the input is not what our model and processing steps expect. Then, if all the sub-functions were run successfully, the handler function should be able to return the prediction.

Deploying the model in a Cloud Function

For our specific project, we decided to use Google Cloud Platform (GCP) Cloud Functions. There are decently good resources to get started. Once you create a project and start a cloud function, the functions described above can be implemented in Python (.py) files. However, only one of the declared functions in the `main.py` file can be selected as the entry point, which is basically a fancy way to name the one function that will be executed if the trigger receives a request. In our case, this method is the handler, which runs the rest of the functions to output a prediction, as mentioned before.

It is important to mention that the imported libraries need to be supported somewhere, somehow. What does this mean? The Python environment in GCP needs to have these packages accessible on its runtime and sadly, not all the dependencies are available. You can check this documentation out for the list of pre-installed packages for each Python runtime version. For example, not all version of TensorFlow 2.0 are available, but the beta is supported, as shown in this requirements.txt. These are workarounds that need to be considered ideally before implementing the function itself in the beginning. Trust me, it can be very time consuming to have your entire ML pipeline done just to realize that there are multiple fixes to do simply because a specific library is not supported for deployment.

Image 1. Image by the author. A screenshot of the code environment for GCP Cloud Functions

Once the function is successfully compiled, the cloud function’s name will appear right next to a green check. This means that no errors were found so far. Clicking on the TRIGGER section at the top would show the endpoint’s URL.

Image 2. Image by the author. Example of an ML model that is ready to work in GCP Cloud Functions.

Incorporating the trigger in our app

Once all these things are done, you can incorporate a POST request in your application’s services and expected the prediction in the response, as seen in this example. Then, just make sure the HTML side is rendering the response from the request, as seen in the example below.

Image 3. Image by the author. Screenshot of the predicted caption in the application’s UI.

One important aspect to think about is the use of overall memory use by the function. In some cases, there would be no response because the trigger consumes more memory than the one available. This can easily be set up in the function’s dashboard up to 2 GB, which should generally be enough. Happily, GCP Cloud Functions provides a dashboard to evaluate different metrics, including memory usage.

Future work

We have discussed and implemented one of multiple existing deployment strategies for ML models. While some models are easier to deploy than others, the fundamental principles don’t change that much. The model with its training weights need to be accessible somehow to serve predictions from an input ad-hoc.

Some optimizations for this approach can be made by using Cloud Storage, for example, to store the weights files and tokenizer and make them easier to access than having to import them from Google Drive into a Cloud Functions temporal directory.

I hope these bits of information are helpful to show how we deployed our ML model in a toy application and how you can do it too, regardless of your level of expertise and the complexity of your model.

If you want to learn more, I’d strongly recommend you to take a look at Rustem Feyzkhanov’s article about how to serve deep learning models using TF 2.0 with Cloud Functions.

So now, go ahead and make the world a better place using Machine Learning!

References

--

--