How Do You Know that Your NLP Model Is Production Ready?

Let us start with a simple quiz. What do these tech news headlines have in common?

  1. Microsoft’s Twitter Chatbot Tay gets released
  2. Berkshire Hathway stock price rises whenever Anne Hathaway in news
  3. How IBM Watson Overpromised and Underdelivered on AI Health Care
Photo by Nathan Shipps on Unsplash

These NLP models were introduced with much fanfare, but behaved badly in real world deployment. These models were NOT student projects. These were big ambitious projects, developed at AI giants and had reportedly achieved human-level performance in in-house testing. Then what caused these models to fail in real world? How do NLP model builders decide when a model is ready for deployment?

What are some of the fallacious assumptions made by the NLP Model builders?

Well, if these assumptions suffice for ensuring success in real world, why do we have many of the NLP models fail in real world deployment when they encounter data that could be quite different from their historical test data? Data is your main challenge in real world deployment not models nor algorithms! If you don’t want believe me, listen to what Andrej Karpathy says!

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

NLP Models are built based on ‘representative’ historical data while real world data itself keeps changing. Data quality is often taken for granted, without adequate validation. Test Oracles are unavailable, or expensive and hard to build. Performance summaries (accuracy, precision, F1-score) on specific test data sets do not provide assurance of generalization performance on real world data.

Some of these issues are also due to the inherent variations and ambiguity present in the human natural language itself.

A single word can mean hundred different things; Hundred different words can convey the same meaning.

A trivial sentence on a positive movie sentiment can be expressed in so many different ways:

  1. I ❤️ this movie,
  2. I love this flick,
  3. I love this படம்,
  4. Movie Aacha Hai!,
  5. IMO, Gr8 movie!,
  6. If you wanted to burn your money, go!
  7. Luv this movee
  8. Arnie Killed it!

Most of the current approaches for assessing NLP model readiness is often done in an ad-hoc fashion. An NLP model typically gets evaluated on a few specific test datasets and its real-world readiness is determined based on overall metrics such as accuracy/F1-score on these handful of datasets. However real-world utterances can be extremely diverse, with different styles, tones, multi-lingual, code mixed, transliterated, misspelled, colloquial, slang/abbreviated based on user’s educational, demographic and socio-cultural backgrounds.

Given the inadequacy of test datasets to cover diverse real-world utterances, NLP model builders also employ manual beta testing to cover input scenarios that may be missing in the test datasets used for performance validation. Such an evaluation approach is not adequate, and often leads to failures on real world deployment which leads to an iterative cycle of fail → fix → redeploy.

If accuracy on test data sets and manual beta testing are NOT sufficient to guarantee readiness for real world deployment, how can we assess models for production readiness?

Let us continue this discussion in our next post. Meanwhile I would be very interested to hear your point of view and inputs on how you assess your NLP models for production readiness. Do keep the comments coming!

If you are interested in reading some papers on this topic:

  1. https://thegradient.pub/frontiers-of-generalization-in-natural-language-processing/
  2. https://thegradient.pub/nlps-clever-hans-moment-has-arrived/
  3. https://www.aclweb.org/anthology/P19-1459
  4. https://arxiv.org/abs/1908.07898
  5. https://vered1986.github.io/papers/breaking-nli-acl.pdf

--

--