Lessons Learned using GPUs in Search Ranking at Medium Scales

Andrew Yates
@ Promoted
Published in
2 min readMay 24, 2024

By: James Hill, ML Infra Engineer @ Promoted.ai

GPUs are the latest craze for AI and machine learning. We ported our L2 “second-stage” pClick and pConversion search and ads ranking models to GPUs. In conclusion: we now support GPU for inference. We can use it… we just usually don’t. It’s not economical except for our biggest customers (public company, public facing search use case, >10M searches a day).

Here’s what we learned.

We observed mediocre latency improvements (<5ms) and no cost improvements on our average insertion count for most customers. They just don’t have enough items to return on complex enough models to greatly benefit. Only for our largest customers approaching CPU limits, we could see modest infra costs improvements.

For most search engineering teams asking us if we use GPU, I think “We support it” is a fine answer. If an ML expert asks us, I think the best answer might be “For training or inference?” If they say inference, I think there’s a few relevant talking points we can take from all the investigation. (If they don’t care about the difference, then that’s an awareness and education issue versus a technical specification issue, and let’s focus on building trust in that case.)

  • Latency improvement from GPU is strongly tied to model architecture, and we prefer focusing on improving quality as long as the latency is “good enough”
  • The vast majority of GPU support in the common ML libraries is focused on training. For inference, you may actually spend more time on auxiliary work (even in C++) than actually using the model
  • Basic instances with GPU are ~3x the cost of even compute-optimized CPU instances
  • I didn’t do a cost-performance comparison (where we compare against an instance with 3x as much CPU) because I think it kind of misses the point. I think CPU would win here — but because of our model architecture and not because of how we’re using devices
  • Given the above point about auxiliary work, then add in feature loading, Blender, additional complexity with sponsored search, etc. we’d prefer to just have more CPU in general in Delivery

We continue to prioritize more and more complex models and more different types of models. Having GPU support will help us expand in both scale and complexity. We don’t see inference compute as a problematic constraint in the foreseeable future, and doubly so for very large use cases where GPUs would help most. It won’t help with small to medium sized customers with limited traffic (<10M queries per day) and smaller models (still using GBDT directly or as a NN feature transform).

GPU hard at work predicting clicks in search

--

--