ThirdAI’s Foundation Models with Bounded Latency For Ultra-High-Speed Production Environments

Anshu
ThirdAI Blog
Published in
3 min readApr 23, 2024

In today’s rapidly evolving distributed production environments, real-time decision-making increasingly relies on foundational NLP models for accurate predictions. These environments demand ultra-high speed and stringent latency constraints, often requiring response times of less than 10 milliseconds on a single CPU core, even with wildly fluctuating input sizes. Additionally, AI pipelines must often operate in heterogeneous distributed environments where advanced computing resources like GPUs may not be available.

Such requirements are frequently observed in e-commerce search, cloud security, and almost all industries requiring edge AI, such as IoT (Internet of things).

End-to-end Classification latency (milliseconds) over single CPU core of BERT, DistilBERT, and ThirdAI. The x-axis is the no of tokens which can vary in production from a a couple to several thousands. Latency of ThirdAI is always below 10ms. All models are tuned to same accuracy.

ThirdAI is addressing the challenge head-on. We’re excited to showcase our specialized, purpose-built pre-trained foundational models. These models are designed to deliver accurate predictions with an end-to-end latency of less than 10 milliseconds on a single CPU core. Remarkably, even when processing up to 5000 tokens, our models maintain a latency under 10ms. The performance of these models during inference and fine-tuning remains consistent, regardless of processor heterogeneity.

For developers looking to harness the power of our technology, we provide a simple script to fine-tune ThirdAI’s foundational model on any multi-lingual supervised classification dataset. This allows for ultra-fast prediction with accuracy levels comparable to those of well-known foundational models. For instance, on intent classification task with the Curekart benchmark, the script achieves classification accuracy of 83.87%, matching the accuracy of a fine-tuned BERT model on the same dataset.

Why won’t models from Hugging Face meet latency requirements?

Our analysis (the figure above) shows that even with as few as 512 tokens (context limit), the latency of Multi-lingual BERT — a widely used model for NLP classification — reaches 700ms. In stark contrast, ThirdAI’s model consistently stays below 5ms latency, even with 5000 tokens — 10 times more input size and 140x faster than BERT.

Even attempts to reduce latency with models like DistillBERT have proven insufficient. DistillBERT only cuts latency to about 350ms at 512 tokens on a single core, which is still far too slow for many real-time application. Here is a Wayfair case study validating the need for much lower latency models.

The ThirdAI Difference: Why Do We Need a New Stack?

Clearly, the latency of current foundational models misses production requirements by more than 100x. Existing approaches like pruning, distillation, and quantization are post-processing techniques applied to an existing model, and thus they are limited in how drastically they can improve performance — typically, the best-case scenario is a 5x improvement. Moreover, since models are expected to be fine-tuned regularly, performance can fluctuate significantly if the weights of foundational models are altered.

Foundation models built using the ThirdAI software stack follow a completely different design philosophy. They are built from the ground up to be trained and inferred within any given budget on a single-core CPU, using dynamic sparsity. Inference latency is intentionally designed to be minimal. Even during training and fine-tuning, gradient descent adjusts the weights so that the dynamically (contextually) sparse model achieves optimal accuracy. The complete AI model cycle including pre-training, fine-tuning, and deployment, only require few commodity CPU cores. This simplifies the complete pipeline significantly both from software and hardware perspective.

Bottom line

Through our innovations, ThirdAI is setting new standards for latency and efficiency in AI models, ensuring that real-time decision-making in AI-driven environments is both fast and reliable.

Link to the Notebook: Demos/universal_deep_transformer/text classification/pretrained_models/main.ipynb at main · ThirdAILabs/Demos (github.com)

--

--

Anshu
ThirdAI Blog

Professor of Computer Science specializing in Deep Learning at Scale, Information Retrieval. Founder of ThirdAI. More: https://www.cs.rice.edu/~as143/