An intelligent shopping list based on the application of partitioning and machine learning algorithms

Nadia Tahiri, Ph.D.
5 min readJul 23, 2019

--

https://www.researchgate.net/publication/334207093_An_intelligent_shopping_list_based_on_the_application_of_partitioning_and_machine_learning_algorithms

Abstract

A grocery list is an integral part of the shopping experience of many consumers. Several mobile retail studies of grocery apps indicate that potential customers place the highest priority on features that help them to create and manage personalized shopping lists. First, we propose a new deep learning model implemented in Python 3 that predicts which grocery products the consumer will buy again or will try to buy for the first time, and in which store(s) the purchase will be made. Second, we introduce a smart shopping template to provide consumers with a personalized weekly shopping list based on their shopping history and known preferences. As the explanatory variables, we used available grocery shopping history, weekly product promotion information for a given region, as well as the product price statistics.

Keywords: Machine Learning, Prediction, Long short-term memory, Convolutional Neural Network, Gradient Tree Boosting, F1, Python, Sklearn, Tensorflow

Introduction

A typical grocery retailer offers consumers thousands of promotions every week to attract more consumers and thus improve its economic performance [TTR16]. The studies by Walters and Jamil (2002, 2003) ([WJ02] and [WJ03]) report that about 39% of all items purchased during a grocery shopping are weekly specials, and about 30% of consumers surveyed are very sensitive to the product prices, buying more promotional items than regular ones. With the recent expansion of machine learning methods, including deep learning, it seems appropriate to develop a series of methods that allow retailers to offer consumers attractive and cost-effective shopping baskets, as well as to offer tools to create smart personalized weekly shopping lists based on the purchase history, known preferences, and weekly specials available in local stores.

The graphical illustration of the proposed model intended to predict the content of the current grocery basket.
At the first level of the model, the LSTM and NNMF networks were used.
At the second level of the model, the GBT model was applied.
Finally, at the last step, we predicted the current grocery basket content using F_1.

A grocery list is an integral part of the shopping experience of many consumers. Such lists serve, for example, as a reminder, a budgeting tool, or an effective way to organize weekly grocery shopping. In addition, several mobile retail studies indicate that potential customers place the highest priority on features that help them to create and manage personalized shopping lists interactively [NPS03] and [SZA16].

Problem statement and proposal

In this section, we present the problem statement and describe the considered machine learning architecture.
First, by using a Canadian grocery shopping database `MyGroceryTour.ca` (see Figure 1), we partitioned consumers into classes based on their purchase histories. Then, this classification was used at the prediction stage. Since the real consumer data contained thousands of individual articles, we regrouped the products by their categories. A principal component analysis (linear and polynomial PCA [Jol11]) was first carried out to visualize the raw data and select the number of the main components to use when partitioning consumers into classes. The application of efficient partitioning methods, such as K-means [Jai10] and X-means [PM+00], allowed us to determine the number of classes of consumers, as well as their distribution by class. We used the Calinski-Harabazs cluster validity index [CH74] to determine the number of cluster in K-means. The Silhouette index [RPJ87] could be also used for this purpose.

Second, we developed a statistical model to predict which products previously purchased by a given consumer will be present in his/her next order. By using explanatory variables, such as available grocery shopping histories, information on the current promotions in stores of a given region, and commodity price statistics, we developed a machine learning model which is able to:

i. Predict which groceries the consumer will want to buy again or will try to buy for the first time, as well as in which store(s) (within the area they usually shop in) the purchase(s) will be made;
ii. Create a smart shopping list by providing the consumer with a weekly shopping list, customized based on his/her purchase history and known preferences.

This list may also include recommendations regarding the optimal quantity of every product suggested. We also calculate the consumer’s optimal weekly commute using the generalized traveling salesman algorithm (see Figure 2).

An F1 statistics maximization algorithm [NCLC12] (see the Statistics section), based on dynamic programming, was used to achieve the objective (i).
This will be of major interest to retailers and distributors.
A deep learning method [GBC16], based on Recurrent Neural Networks (RNN) and Convolutional Neural Network (CNN), both implemented using the TensorFlow library [HLYX18],
was used to achieve the objective (ii). This can provide significant benefits to consumers.

Our prediction problem can be reformulated as a binary prediction task. Given a consumer, the history of his/her previous purchases and a product with its price history, predict whether or not this product will be included in the grocery list of the consumer. Our approach applies a generative model to process the existing data, i.e., first-level models and then uses the internal representations of these models as features of the second-level models. RNNs and CNNs were used at the first learning level and forward propagation neural networks (Feed-forward NN) were used at the second learning level. Thus, depending on the user’s u and the user’s purchase history
(order{t-h:t}, h > 0), we predict the probability that the product i is included in the current shopping basket order{t+1} of u.

Dataset

In this section, we discuss the details of our synthetic and real datasets,
the latter obtained from our website MyGroceryTour.ca.

Features

All features used in our study are presented below:

- user_id: the user ID. We anonymized all data used in our study.
- order_id: unique number of the basket.
- store_id: unique number of the store.
- distance: distance to the store.
- product_id: unique number of the product. We tested our model with 1,000 products only (out of 49,684 products), which belonged to 5 out of the 24 available categories, i.e., `Fruits-Vegetables`, `Pasta-Flour`, `Organic Food`, `Beverages`, and `Breakfast`; the rest of the categories were not considered in our tests.
- category_id: unique category number for a product.
- reorder: the reorder was equal to 1 if the product has been ordered by this user in the past, 0 else.
- special: discount percentage, by interval, applied to the product price at the time of purchase.

In total, we processed the data of 1374 users (i.e., consumers). Among them, we had 374 real users and 1000 users whose behavior was generated following the distribution of real users (see Figure 3) and the consumer statistics available in the report by Statistics Canada (2017). The product categories were available for each product. So, the product category was one of the explanatory variables used in the model. In total, we considered 5 (of 24) product categories. The current version of our model does not allow a new product to be bought by the user (i.e., every user can only buy products that were present in at least one of its previous shopping baskets). We only considered real users having a sufficient number of previous shopping baskets available (>50 baskets). The average basket size was also used to predict the content of the current basket size for each user.

Two types of features, categorical and quantitative variables, were present in our data. Only the distance and special features were quantitative variables, the rest of them were categorical. To handle the categorical variables, we applied a hashing scheme to deal with large scale categorical features. We used the LabelEncoder function of the scikit-learn package of Python (version 3).

See

https://www.researchgate.net/publication/334207093_An_intelligent_shopping_list_based_on_the_application_of_partitioning_and_machine_learning_algorithms

http://conference.scipy.org/proceedings/scipy2019/nadia_tahiri.html

--

--