Learnings from Scaling TensorFlow to 300 million predictions per second

You read that right. 300 Million Predictions. Every second

Published in

Geek Culture

6 min readOct 11, 2021

Join 31K+ AI People keeping in touch with the most important ideas in Machine Learning through my free newsletter over here

Machine Learning is changing up a lot of fields. One of the big ones is advertising. While companies like Google and Facebook are infamous for their use of big data to target personalized ads, there are many other players in this field. This should not be a surprise since online advertising is said to be a 100 Billion Dollar Industry.

From a technical standpoint, this industry is an interesting blend of two fields, networking and machine learning. This presents an interesting set of challenges that have to handled. We need high accuracy, constantly updating models, and very low latency. This makes it hard to implement the traditional approach/models. The authors of the paper, “Scaling TensorFlow to 300 million predictions per second” detail their challenges and approach to tackling the issues. They do this by sharing their learnings from working with Zemanta.

Good background knowledge to help you understand the setting of this paper/work

Above is a passage from the authors. It explains what Zemanta is, how the service operates, and how the ad space is sold. The last bit detailing the use of machine learning in maximizing their KPIs is pretty interesting. Maybe a reader of this article will go on to work in this field (make sure to remember me lol).

Some context to the design choices made.

In this article, I will share the learnings that allowed the authors/team at Zemanta to 300 Million Predictions every second using the TensorFlow framework. As always an annotated version of the paper will be at the end (the arxiv link is already shared). Make sure to read through it yourself. Which of these learnings did you find most interesting? Let me know in the comments. Feel free to reach out to me through social media if you want to discuss the paper in greater detail.

Learning 1: Simple Models are (still) King

This is something that people working in Machine Learning know very well. Going over AI news you would be forgiven for thinking that Machine Learning equates to large models with complex steps. It’s no surprise that most beginners conflate Machine Learning with Deep Learning. They see the news about GPT-3 or Google’s ResNet. And they assume that to build great models, you need to know how to build these giant networks that take days to train.

Deep Learning has gained a lot of interest in recent days.

This paper presents the reality. As anybody who has worked in Machine Learning can attest, the following meme is accurate:

Simple models are easier to train, can be tested quicker, don’t require as many resources, and generally don’t fall behind too much. Applying large models at scale would increase server/running costs by a lot. The authors reflect a similar sentiment in the following quote from the paper:

We additionally do not utilize GPUs for inference in production. At our scale, outfitting each machine with one or more top-class GPUs would be prohibitively expensive, and on the other hand, having only a small cluster of GPU machines would force us to transition to a service-based architecture. Given that neither option is particularly preferable and that our models are relatively small compared to state-of-the-art models in other areas of deep learning (such as computer vision or natural language processing), we consider our approach much more economical. Our use case is also not a good fit for GPU workloads due to our models using sparse weights

Many company’s don’t have the large systems of GPUs lying around that they can just use to train and predict data. And they are mostly unnecessary. To quote the authors: the relatively small models are much more economical.

Learning 2: Don’t Overlook Lazy Optimizers

Sparse Matrices refer to matrices where the values are largely 0. They are used to represent systems where there is limited interaction between 2 pairs of components. For example, imagine a humanity matrix where we the row and columns correspond to people on the planet. The value of a particular index is 1 if the two people know each other and 0 if they don’t. This is a sparse matrix since most people don’t know most other people in the world.

The matrix Zemanta was working with was sparse. They identified the cause of this being the fact that most of the features were categorical. Using an Adam Optimizer was increasing run costs by a lot (50% more than Adagrad). Adagrad on the other hand had a terrible performance. Fortunately, there was an alternative that gave a great performance without being very expensive: LazyAdam.

Lazy Evaluation is a well-established practice in software engineering. Lazy Loading is often used in GUI/Interactive-based platforms like websites and games. It is only a matter of time that lazy optimizers become established in machine learning. Keep your eye out when it does happen. If you’re looking to find avenues to research in Machine Learning, this might be an interesting option.

Learning 3: Bigger Batches → Lower Computation Costs

This was completely surprising to me. “By diving deeply into TF, we realized that the computation is far more efficient (per example) if we increase the number of examples in a compute batch. This low-linear growth is due to TF code being highly vectorized. TF also has some overhead for each compute call, which is then amortized over larger batches. Given this, we figured that in order to decrease the number of compute calls, we needed to join many requests into a single computation.”

This was a new one. Big training batches lead to lower computational costs? I honestly haven’t been able to understand why. If one of you knows the reason, make sure to share it with me. I’d love to learn why. The scale is surprising too. They halved their computational costs. The full results from optimizing like this were:

This implementation is highly optimized and is able to decrease the number of compute calls by a factor of 5, halving the CPU usage of TF compute. In rare cases that a batcher thread does not get CPU time, those requests will time out. However, this happens on fewer than 0.01% of requests. We observed a slight increase in the average latency — by around 5 ms on average, which can be higher in peak traffic. We put SLAs and appropriate monitoring into place to ensure stable latencies. As we did not increase the percentage of timeouts substantially, this was highly beneficial and is still the core of our TF serving mechanisms

The slightly increased latency makes sense.To read what exactly they did, check out section 3.2. It’s a whole of networking stuff, so I’m not an expert. But the results speak for themselves.

Closing

This paper is an interesting read. It combines engineering, networking, and machine learning. Furthermore, it provides insight into the use of Machine Learning in smaller companies where huge models and 0.001% improvements in performance are not crucial.

You can read my fully annotated paper here (available for free download):

Reach out to me

If that article got you interested in reaching out to me, then this section is for you. You can reach out to me on any of the platforms, or check out any of my other content. If you’d like to discuss tutoring, text me on LinkedIn, IG, or Twitter. If you’d like to support my work, using my free Robinhood referral link. We both get a free stock, and there is no risk to you. So not using it is just losing free money.

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

My Substack: https://codinginterviewsmadesimple.substack.com/

Get a free stock on Robinhood: https://join.robinhood.com/fnud75