Search Model Serving Using PyTorch and TorchServe

Published in

Walmart Global Tech Blog

10 min readJan 23, 2023

Walmart Search has embarked on the journey of adopting Deep Learning in the search ecosystem to improve search relevance. For our pilot use case, we served the computationally intensive Bert Base model at runtime with an objective to achieve low latency and high throughput.

We built a highly scalable model serving platform to enable fast runtime inferencing using TorchServe for our evolving models. TorchServe provides the flexibility to support multiple executions.

Evolution

One monolithic Search Query Understanding application was responsible for understanding the user’s intent behind the search query. Through a single Java Virtual Machine (JVM)-hosted web application, it loaded and served multiple models. Experimental models were loaded onto the same query understanding application. These models were large, and computation was expensive.

With this approach, we faced the following limitations:

Inability to refresh a model with the latest version or add a new experimental model without full application deployment
Increased memory pressure on single application
Slow startup time
With concurrent model execution, the realized performance gain was minimal due to CPU limitations

Search Model Serving Using PyTorch and TorchServe

Evolution

Written by Pankaj Takawale