Search Model Serving Using PyTorch and TorchServe

Pankaj Takawale
Walmart Global Tech Blog
10 min readJan 23, 2023

--

Search Model Serving GPU Metrics

Walmart Search has embarked on the journey of adopting Deep Learning in the search ecosystem to improve search relevance. For our pilot use case, we served the computationally intensive Bert Base model at runtime with an objective to achieve low latency and high throughput.

We built a highly scalable model serving platform to enable fast runtime inferencing using TorchServe for our evolving models. TorchServe provides the flexibility to support multiple executions.

Evolution

One monolithic Search Query Understanding application was responsible for understanding the user’s intent behind the search query. Through a single Java Virtual Machine (JVM)-hosted web application, it loaded and served multiple models. Experimental models were loaded onto the same query understanding application. These models were large, and computation was expensive.

With this approach, we faced the following limitations:

  • Inability to refresh a model with the latest version or add a new experimental model without full application deployment
  • Increased memory pressure on single application
  • Slow startup time
  • With concurrent model execution, the realized performance gain was minimal due to CPU limitations

--

--

Pankaj Takawale
Walmart Global Tech Blog

Leading Machine Learning Model Serving Platform & Query Understanding at Walmart Search. Passionate about engineering excellence on large Distributed Systems