End-to-End Healthcare AI Pipelining: Part 2

The Cyft Supermarket

Lydia Skrabonja
Nov 7 · 5 min read

This is the second part in a two-part series. If you want to know how we got to this point — read the first part.

Photo by Rob Maxwell on Unsplash

The Supermarket

7. Supermarket

The inspiration for our pipeline, the supermarket is a stack of all the Aisles and Events. Our first pass through the pipeline usually results in a 1-aisle supermarket we’ve nicknamed “the farmer’s market”. If you can get a farmer’s market up and running, the other aisles can get added to that to form a fully-fledged supermarket.

Once we have a supermarket, everything else can use it. Different models can be pulled out of the same data in the same structure. Further analysis of the results of those models can be compared to data points from the supermarket. All of this can be done without changing the supermarket at all. It also means that someone can continue to build out the modeling work, while another engineer adds more aisles. When those aisles have been added, they can be loaded into the model in the same way. Everything has a place — Marie Kondo would be proud.

Photo by Jonathan Borba on Unsplash

Shopping

8. Carts

9. Spark models

After the supermarket has been built, you can take a Cart and select the data you need. That way, if you want to model different cohorts, you can take out different carts. It also works if you want to have different models with different data inputs. The underlying supermarket doesn’t change.

After selecting your feature ingredients, if the best next step is to build a model using Spark ML, that can be done right out of the cart. If those results are what’s going to be deployed, you can move directly into the Analysis section and skip all the deep-learning specific sections. This structure doesn’t force you to use all the steps if they’re not useful to you. Otherwise, the evaluation metrics of the Spark ML-based model provide a great baseline for comparing against the deep-learning models. It’s possible to guess what kind of algorithm will produce the best results for a given dataset and outcome, but it’s impossible to know for sure. Having a baseline also helps us know if it’s worth it to put in the resources to build a finely-tuned deep learning model.

Meal Prep

10. Meal prep (tensorize)

Deep learning models like their datasets formatted into multidimensional matrices called tensors, which is not how Spark likes them. To make the deep learning side happy, we tensorize the data we’ve pulled out of the carts at this stage. Because of the complexity of healthcare data, and its various lags, the tensors generally come out in 4 dimensions, namely: member, service date (the day the thing actually happened), received date (the day we learned of it), and feature. That way, we can use indexing and slicing to create train and test sets efficiently on the GPU. We include a different service and received date to account for forms of lag we often see in healthcare data and make sure we’re building models that mimic what will happen in the real world.

Photo by Zahir Namane on Unsplash

The Cool Stuff

11. Deep learning models

We’ve arrived! All the data is in a nice, clean, reproducible format and we can finally get started on our fanciest of models. These notebooks run on GPU instances that are spun up specifically for this purpose. They also have shared code repositories so multiple people can work on the same model files at once through git. We add utility functions in other files to prevent those notebooks from getting too long and edit the notebooks in tandem. We try to maintain a balance in our master model notebooks to make them as lightweight as possible without being unnecessarily abstracted.

Slicing and Dicing

12. Analysis

While we’re building our models, we track a lot of performance metrics along the way. However, we’ve found that there’s always interest in how model performance did on certain cohorts, or timelines, or a host of other things that end up being used for customer discussions and presentations. To easily get answers to those questions, our modeling results can join back up with the supermarket and the slicing and dicing follows a very similar structure to profiling. This is a really useful place to put all of that kind of data munging, since the deep learning instances have embedded or encoded values which are not as intuitive to categorize by.

Photo by Fabrizio Magoni on Unsplash

And Serve

13. Deployment

14. Delivery

15. Reporting

Once the final model has been set, the pipeline agreed upon, and the cadence established, it’s time to actually give people some results. In our case, that means some Deployment cleanup first. This can be calibrating the results of the model, translating some of the outputs, or blending two models together.

Then, our output is generated in the Delivery section. In some cases, that’s a rank-ordered list, with columns that have been agreed-upon in advance to help slot the results into a workflow. These other columns also come from the supermarket, since all the information we need is already there in unencoded form. Attributes like name, which isn’t used in the model by itself, are very useful for clinical teams to know.

Once that’s done, we also generate reports to let us, and everyone involved, see what is and isn’t working. These are usually PDF reports that flow right after the lists themselves, also utilizing any relevant data from the supermarkets to implement. It’s all in the same environment and using the same tools, so every metric can be traced all the way back to ingest if need be.

After all this set up and front-loaded work, our pipeline is now ready for continuous deployment. We can ingest data and run it through on whatever cadence is best for the project we’re working on. We usually even set up an automated trigger that runs the pipeline (aside from the analysis sections) when new data for that project hits our system.

In all, this is the pipeline that helps us get from dataset (or database dump) to actual results — but it’s still a work in progress! Every project helps us get better and faster at what we do. Every improvement to our pipeline gives us more time to make better models. Healthcare data is hard, but the Cyft Supermarket makes it manageable.

Cyft

A.I. (actual intelligence) for better healthcare. www.cyft.com

Lydia Skrabonja

Written by

Data Scientist at Cyft Inc

Cyft

Cyft

A.I. (actual intelligence) for better healthcare. www.cyft.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade