Should I Deploy Pre-Trained Models?

A project-side analysis on whether you should deploy a pre-trained model or wait until you have a custom dataset

Published in

birdie.ai

5 min readDec 8, 2020

By Emilie Morvant — Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=58812416

Whenever you embark on a new adventure to the land of ML, a piece of common equipment is a pre-trained model. For every task, there is a ready-made dataset that spawned a model so sweet you can use it for everything.

It’s pretty common to base proofs of concept (or entire business models) from the foundation of pre-trained models. While it is a perfect way to start an exploratory dive into an unknown field, it is also a pretty huge risk going forward with it all the way to Production.

Why?

A pre-trained model is a collection of weights obtained from training a model to execute a task using a specific dataset. Word2Vec embeddings trained on Wikipedia, image embeddings trained on ImageNet, sentiment analysis models trained on Twitter data… Almost every ML task has something that is ready to download and deploy in a flash. Libraries such as spaCy, Tensorflow, Keras and Transformers are all filled with great pre-trained models, easily reusable.

Most of those models go hand in hand with their metrics, more often than not showing that they are state-of-the-art. But those metrics are only applicable to that specific domain and task, and while they may be good for your task they aren’t automatically great. The following article goes in-depth on these technical issues.

Approach pre-trained deep learning models with caution

Pre-trained models are easy to use, but are you glossing over details that could impact your model performance?

medium.com

You need to understand that deploying pre-trained models to execute a slightly different task may incur classification errors that may not be apparent on your first proof-of-concept tests, and may spill over the final product. So, the game is identifying how much those errors will affect you and how important is that task to you.

In other words: how much time you wanna spend on this task?

What’s the Impact of your PoC?

To accurately measure if you should deploy pre-trained models straight to production, you need to know exactly how important is the function it’s going to serve (impact), and what is your resistance to errors in that specific task (error resilience).

Impact

There are tons of articles that address what is “impact” in Data Science projects, and most of those definitions are pretty extensive and thorough. In this text, we assume a shorter, simpler definition: the impact of a project/model is measured by how much it shapes the product and how quickly it changes its face.

A task has a high impact if it is immediately noticeable in the final product, or if it is integral to create an internal process. For instance, a recommendation model for e-commerces, or a named entity recognizer for newspapers.

If the results of your task aren’t aggregating immediate value to your customers, it has low impact. Tools such as summarization models to help textual annotators and internal data validation models may help in the efficiency of the overall process of product development but are not seen by the public eye.

Error Resilience

You could compare error resilience and risk, seeing as the latter is much more prominent in project management lingo. We’re using only perceived risk as our main metric: how noticeable is a classification error in the final product. A task is resilient to errors if it has a high tolerance to classification errors, be they false positives or false negatives.

A facial recognition model has low resilience to errors because you can’t misclassify a person in a security program, just like a stock brokering software cannot predict a wrong series of stocks going up when in reality they are tanking.
Ancillary models such as annotation helper modules or feedback training (low impact tasks) usually have high error resilience. Products that dilute the perception of errors also have high resilience: when you show a graph that aggregates thousands of data instance classifications, errors won’t be noticed unless the model is really bad at its job.

What Should I Do?

So, let’s break down the circumstances of your project and how should you lead them:

Low impact: that’s a task that is sitting on the backlog and there’s no pressure over its development. If it has low error resilience, you can’t just slap a pre-trained model to do it and expect good results. If it has high error resilience, you could simply deploy it as-is… but if you have the time, why not do it the right way? Annotate a new dataset, spend some time in parameter tuning, make a tailor-made model for the job.
High impact, low resilience: that’s a groundbreaking feature that absolutely cannot suffer false positives or negatives. This means that the pre-trained weights you have can be deployed if and only if your metrics over your task are satisfactory. Create a test dataset and measure it, and tune accuracy percentage thresholds to reduce either false positives or negatives.
High impact, high resilience: just do it. If the errors that come from that external model aren’t affecting the product, slap that bad boy in your pipeline and move on! For instance: a simple word embedding model in a search bar improving navigation on an e-commerce website or sentiment analysis aggregation charts on social listening products. Fine-tuning this model can be done later.

Our Data Science Development team here at Birdie loves finding good pre-trained models, but we know that the true value of AI lies not in complex solutions trained over generic data: we have a veritable trove of texts and images from all across the globe that we use to fine-tune all our models. Currently, our entire data enrichment pipeline is properly fine-tuned with reviews and product descriptions from e-commerces like Amazon and BestBuy, delivering sharp insights about user feedback to bridge the gap between big data and actionable items!