DICOM Image Classification to detect Pneumonia using Snowflake!

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

9 min readFeb 17, 2023

Image classification is the task of assigning a label or class to an entire image. Images are expected to have only one class for each image. Image classification models take an image as input and return a prediction about which class the image belongs to.

There are potentially n number of classes in which a given image can be classified. Manually checking and classifying images could be a tedious task especially when they are massive in number (say 10,000 or 100k or 1Million) and therefore it will be highly beneficial especially healthcare if we could automate this entire process using computer vision to predict the class of an image.

Some examples of image classification include:

Labeling an x-ray as pneumonia or not (binary classification).
Classifying a handwritten digit (multi class classification).
Assigning a name to a photograph of a face (multi class classification).

Medical Images(DICOM) are unstructured data types and snowflake allows you to vectorize these images by leveraging image reading libraries hosted in Anaconda distribution such as scikit-image and openCV on Snowpark using Python. Tensorflow/Keras is a supported library in Snowpark that can be leveraged to train the deep learning models by pushing down as stored procedure based training workload to Snowflake using newly released snowpark-optimized warehouse . After the training is complete, we can save the model file in an internal stage.

Snowflake ML Architecture to train Image Classification model

We then deploy the snowpark trained tensorflow model file as a User Defined Function (UDF) for inference that can detect the probability of Pneumonia from a new image.

We also have used streamlit library that works seamlessly to orchestrate the end-to-end flow, visualize the model metrics from any library hosted in anaconda and also be hosted as a web- app on a data science use case such as this where we detect pneumonia given an image.

Now Let’s look at how we have built this deep learning model step by step.

Set Up Snowflake connection, Warehouse , Database, Schema, external Stage connection to AWS S3 Bucket

Set Up Snowflake Database, Schema , Warehouse and Stage

Database, Schema & Warehouse set up

--Create Database
Create or replace database IND_SOL_DICOM;
--Create Schema
Create or replace schema DICOM;
-- For Model Training
Use warehouse snowopt_wh;
-- For All other workload
Use warehouse standard_wh

External Stage

Create AWS S3 Bucket if AWS is your cloud provider. We also support Azure Blob storage and Google Cloud Storage. Obtain the information to create external bucket on AWS

s3 bucket URI
aws_key_id
Aws_secret_key

Define external stage pointing to the S3 bucket

create or replace stage "DB"."Schema".<Stage_Name>
url = <s3 bucket URI>
credentials = (aws_key_id=<aws_key_id>
aws_secret_key='aws_secret_key')
directory = (enable = true);

### Validate stage created
list @stage_name;

Image Vectorization in Snowflake

Scikit-image library is available within secure Snowflake Anaconda repository. It can be imported and used to read images from one staging area and vectorize the content into Snowflake tables. This step brings the advantages of having the images in a format that will allow faster iteration and processing. Because the function to vectorize can be registered as a UDF, image processing can be scaled easily with Snowflake and integrated into the data pipeline.

Using the above skimage read, we can wrap it within a UDF User Defined Function and use the UDF within snowflake as SQL

The above when executed will parse the images and vectorize all the image files and can be stored as 150x150 (or any) matrix in a snowflake table.

Remember, storing in a table is only optional. If you are not interested in seeing the vectors after parsing the images you can skip storing in a table. The vectors, after reading, can be directly passed as input to the training pipeline within the stored proc.

When you try to access the table where the data is stored, it would look something like this

When you show the Images compared to actual Vectors on a streamlit app, it looks pretty like this. This was done simply by wrapping the url generated from presigned_url, pointing to a snowflake stage relative path to display the images on the streamlit app.

Model Definition

TensorFlow is an open-sourced end-to-end platform, a library for multiple machine learning tasks, while Keras is a high-level neural network library that runs on top of TensorFlow. Both provide high-level APIs used for easily building and training models. Keras is bit more user-friendly because it’s built-in Python. More information on tensorflow keras can be found here.

Both libraries are included within Snowflake Anaconda repository and can be used for training models and running inference within Snowflake.

Snowflake Integration with Anaconda hosted Deep Learning models such as Tensorflow & Keras

First things First, let’s import all the required tensorflow libraries

Now along with this, you need to import snowpark related libraries and set up connection to the database, schema and warehouse that was set up above. Use the Snowpark for data science playbook to set up snowpark and connect to snowflake database if you are new to snowpark.

Now Let’s get into the Model pipeline steps and definitions

At a high level, we are performing the following steps

Using Tensorflow and Keras libraries with Stored Procedures and UDFs, training can be executed within Snowflake. At a high level, follow these steps to train the model:

Read Image Array
Split images between Train (75%) and Test (25%)
Restructure Image Array
Reshape Image Array
Data Augmentation — To standardize all images in one common format, data augmentation should be applied on all images before training. ImageDataGenerator from Keras preprocessing library can be used for this task. It generates batches of tensor image data with data augmentation. Each of these feature’s parameters are applied to the entire train and test image data set.

6. Model Definition

In this use case to classify images with or without pneumonia, the expected result from the model is binary, with a positive or negative — yes or no. Therefore, binary_crossentropy is used as log loss function and activation function is sigmoid in the output layer. If a multi class classification model is being trained where there are more than 2 classes then categorical_crossentropy should be used as log loss function and softmax as activation function in the output layer.

When the model is defined, it would return a tensorflow model definition blueprint as below

7. Model Fitting with Learning Rate reduction as Early stopping criteria

**Model Fit and Early stopping on Learning rate reduction**

Keras library ReduceLROnPlateau can be used for early stopping as soon as the learning rate continues to decline beyond patience level.

Then fit the model on training dataset with the number of epochs, validation data and callbacks that were defined.

Steps per epoch is contingent to both the batchsize per epoch and number of epochs. If you are looking to hardcode a value, feel free to do so.

All of the above pipeline code is wrapped within a stored procedure and is registered in Snowflake. Check Snowpark for data science playbook to learn how to register a Stored procedure in Snowpark.

Training the Model as Stored procedure and Save the model file

Training the model in stored Proc and save model file

After registering the stored procedure for training, this can be executed from any API or with Snowflake Tasks as part of any data pipeline as you would treat it as any other database object. This is where snowflake is a huge differentiator compared to other data platforms.

Calling the Stored Procedure with Model Parameters

If you look at the model definition, we can also capture all the metrics on how did the model perform while training, what were the metrics by writing them to snowflake table(s). This way, you can always go back and check/validate the model training. Down below, we are using the saved model metrics (confusion matrix, f1score, recall, precision), model loss and model validation accuracy per epoch in the streamlit app.

Finally the training will produce a model file that can be saved in internal staging area and deployed as UDF for inference. Based on the accuracy of the model, different UDFs can be used for testing and inference.

Model Performance on training

As one example, when we train on 2000 images with 2 different epochs @100 and @200. We found better validation accuracy and F1 score on Iteration 1.

False positives(detecting pneumonia on normal cases) and False Negatives (detecting normal on pneumonia cases) were lower on model trained on 200 epochs than training on 100 epochs.

So we productionize the Iteration#1 model. It is usually observed that as we train on higher number of epochs, the deep learning model has a better learning curve.

Model Performance(accuracy, speed) training on snowpark

Visualize the Metrics on a totally different Iteration where we see 87.36% validation accuracy. The history was captured in snowflake table(s) and is used as source to display on streamlit app where the visualizations were built on matplotlib and seaborn

The confusion matrix, precision, recall can suggest the model performs really well when trained at 200 epochs. Detecting normal cases is somewhat better compared to detecting pneumonia as the recall is higher which reflects the true positive rate for class 1 Normal. In this use case, I would imagine we must measure the model performance based on f1-score as I feel both False positives and False Negatives were both important concerns to address. I am NO doctor or a medical practitioner, I will leave it to the discretion of qualified medical professionals to make that decision.

Deploy model as UDF in snowflake for Inference

Once the model file is generated, we can deploy a UDF using the model file to detect pneumonia on new images. If you want to learn how to deploy a model saved in internal stage as UDF, refer this Snowpark for data science playbook

As new images are available, the UDF can predict the probability of pneumonia versus normal. This can be included in your data pipeline or made available on any App (like Streamlit) or used as API.

Now comes the juicy part, here you can see the model is able to predict the probability of pneumonia or normal . It is up-to the users to define the threshold (for example ≤50% Normal > 50% Pneumonia) on the probability to confirm pneumonia or not.

Similarly, this use case can be expanded further to detect skin cancer, retinal images to detect cataract.

This can also be expanded to any industry to classify on images which were already labelled and leverage machine learning to predict labels on unseen images.

What we can prove out through this use case is Snowflake is NOT just a datawarehouse platform that can process only structured data or semi-structured data, it can expand beyond datawarehousing and be leveraged to read and vectorize unstructured data such as images files, it can also train neural network models using the power of snowflake compute at scale. At the same time, simplify the model deployment as native SQL. All of this is possible due to the great vision on this platform and ever expanding support from our customers/partners. Thrilled and excited to be part of such a great company.

I am happy to have collaborated and worked with our champion team Venkat Sekar @kesav Rayaprolu and @Murali Gandhirajan

List of snowflake features used in building the use-case

External or Internal Stages to store Image files
Dynamic File Access [PrPr] to read unstructured data
Snowpark for python [GA] for Data Engineering
Snowpark with anaconda [GA] for Machine Learning
Snowpark optimized warehouse [GA] for scalability
Streamlit for UI

DICOM Image Classification to detect Pneumonia using Snowflake!

Written by Karuna Nadadur