Grocery Item’s Nil Pick prediction using a Machine Learning Model

Item Substitutes play a key role in customer satisfaction when the ordered item is unavailable and hence are an integral part of the customer order.

Abhinandan Sonvane

Published in

Walmart Global Tech Blog

15 min readNov 12, 2020

How it works…

eCommerce platforms have taken shopping to a whole different level where at the comfort of their couch customer can choose from thousands of items and get those delivered at their door-step but this is way easier said than done.

eCommerce order fulfillment is a complex process that involves collaboration from multiple teams ranging from the supply chain, stores/warehouse, logistics, payments to name a few. Let’s discuss how an Order materializes into a physical package/shipment.

An Order consists of various items belonging to different categories offered by the eCommerce platform. To fulfill an Order, an employee would go about picking these items from different sections of the store/warehouse, put them in a trolley/basket, and finally prepare a delivery shipment. Technically the order fulfillment journey starts with the Store Picker (term for the Employee) who downloads the order details consisting of the items to be picked from the store or warehouse. This is followed by the generation of an optimized path (known as Pick Walk) to pick the ordered items so as to minimize the time to collect all the ordered items. Now, this is where the Item Substitutions come into play, which forms the core of the problem that we intend to discuss.

Problem Statement…

During the pick walk, an Item may not be available as it might be Out of Stock or its packaging has some quality concerns. In such cases, the Picker would fall back on the Substitutes of an Item to replace with the originally ordered item. But sometimes even the Substitute does not qualify for picking

Substitute being inappropriate as compared to the original item.

Photo reference https://images.app.goo.gl/fnwj6DKTgsqpvj6X7

2. The substitute has gone out of stock.

3. The substitute has the volumetric issue as it may not fit in the Tote/Trolley used for picking.

Photo Reference https://www.flickr.com/photos/schuminweb/12231549264/in/photostream/

Well, this leads to Nil Pick ( i.e. the Item or its Substitute could not get picked and hence not fulfilled in the Order ) or Manual Pick ( the Picker chooses an item based on his judgment and due-diligence ). Such kind of scenarios have led to

Increase in Customer Rejection — as the Customer didn’t like what we offered as a Substitute.
Increase in the overall fulfillment time — as the manually picked substitute might need approval from the Store Manager.

Proposed Solution…

Machine Learning is one of the most sought-after tools in the tech industry these days. These mathematical models find usage ranging from self-driving cars to virtual assistants. We would like to exploit the mathematics of Machine Learning to predict Nil Pick and Manual Pick for an Item based on the above-discussed causes. So we start by establishing a hypothesis that the Machine Learning Model will help to prove in a purely mathematical manner.

Hypothesis verification will involve:

Processing and analysis of the Store data.
Create a data pipeline and develop a regression-based machine learning model for the prediction of Nil Pick. A similar approach can be followed to predict Manual Pick.
Execution of the Model to interpret the results.

Quandary of selecting Machine Learning Model

There are three classes of artificial neural networks that AI uses.

Feedforward Neural Networks/Multilayer Perceptron (MLPs)
Convolutional Neural Networks (CNNs)
Recurrent Neural Networks (RNNs)

Feedforward neural networks are the first type of artificial neural networks to have been created and can be considered as the most commonly used ones today. These neural networks are called feedforward neural networks because the flow of information through the network is unidirectional without going through loops.

Feedforward neural networks can further be classified into single-layered networks or multi-layered networks, based on the presence of intermediate hidden layers.

The number of layers depends on the complexity of the function that needs to be performed. The single-layered feedforward neural network consists of only two layers of neurons and no hidden layers in between them. Multi-layered perceptron consists of multiple hidden layers between the input and output layers, allowing for multiple stages of information processing.

Use MLPs For:

Tabular datasets
Classification prediction problems
Regression prediction problems

Pros

Highly flexible and can be applied to a varied dataset like tabular, image, and text to name a few.
Appropriate for regression prediction problems where a real-valued quantity is predicted given a set of inputs.

Cons

As it is a fully connected model, the total parameters can grow to a very high number which can create redundancy and inefficiencies in higher layers.

Number of perceptron in layer 1 multiplied by # of p in layer 2 multiplied by # of p in layer 3…and so on.

Not suitable for image dataset as it disregards the spatial information.

Convolutional neural networks, ever since its conception have almost exclusively been associated with computer vision applications. That’s because their architecture is specifically suited for performing complex visual analyses.

The convolutional neural network architecture is defined by a three-dimensional arrangement of neurons, instead of the standard two-dimensional array.

Image Reference https://images.app.goo.gl/JaNc7quRC6CJRV61A

The first layer in such neural networks is called a convolutional layer. Each neuron in the convolutional layer only processes the information from a small part of the visual field. The convolutional layers are followed by rectified layer units or ReLU, which enables CNN to handle complicated information and aides in classification/recognition.

Use CNNs For:

Object recognition applications like machine vision and self-driving vehicles.
Document classification used in sentiment analysis and related problems.

Recurrent neural networks (RNN), as the name suggests, involves the recurrence of operations in the form of loops. These are much more complicated than feedforward networks and can perform more complex tasks than basic image recognition

Image Reference https://images.app.goo.gl/FUW6XCinS3uL9i5W6

Use RNNs For:

Sequence prediction problems.
Speech and Text prediction.
Natural Language generation.

Recurrent neural networks are not appropriate for tabular and image datasets. They are difficult to train and have a very short-term memory, which limits their functionality. To overcome the memory limitation, a newer form of RNN, known as LSTM or Long Short-term Memory networks are used.

After comparing the above-discussed machine learning models, we opted to build a predictive Multi-layered Perceptron for our Nick Pick prediction as it best fits our problem statement for the following reasons:
1. Our training dataset is a comma separated Store data related to Bad Substitutions and Volumetric issues. Both CNN and RNN are inappropriate for CSV/Tabular dataset.
2. We intend to build a predictive model. CNN is apt for image analysis use-case and RNN for sequence prediction problems. Also, RNNs and LSTMs have been tested on forecasting problems, but have been outperformed by simple MLPs based regression models applied on the same data.

TensorFlow Docker Setup

TensorFlow start-up Docker command

$ docker run -it -p 8888:8888 -p 6006:6006 -v <NOTEBOOK_PATH_ON_LOCAL_SYSTEM>:/tf/notebooks tensorflow/tensorflow:latest-jupyter

Install pandas in the Docker container

$ pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org pandas

Store Data Analysis

Data from different stores are collected to form the Training and Evaluation datasets. The data is a weekly count of Bad Substitutions, Volumetric issue instances, and Nil Pick.

1. Panda Dataframes : Read CSV data into panda dataframes.

2. Scatter Plot: The most important part of interpreting data is being able to visualize trends and correlations between dataset features which in our case include Bad Substitutions, Volumetric issues, and Nil Pick. We do this through data plots, which allows us to easily discover interesting patterns/correlations in the dataset and decide whether the dataset is feasible for training a machine learning model.

We create data plots using the pyplot API of Matplotlib. One of the most common plots for data analysis is the 2-D scatter plot. It’s used for plotting the relationship between a numeric dependent feature (Y-axis) and a numeric independent feature (X-axis). Nil Pick and Manual Pick represent the Dependent variables for our use-case

Nil Pick correlation with Bad-Substitution and Container-404 — Figure 1. Nil Pick Scatter Plots

3. Interpreting Plots: After creating the dataset plots, we analyze them to determine whether Nil Pick has any correlation with Bad Subs or Container Not Fit features. The main thing to look-out for in the plots is non-uniform distributions. A non-uniform distribution, such as a normal distribution or a multi-modal distribution, shows that the Bad Subs and/or Container Not Fit features can potentially be used by a machine learning model to predict the Nil Pick. Hence, the correlation (Figure 1.) between these features makes a strong case for building the machine learning model to predict Nil Pick.

Data Processing

In this section, we intend to build the input pipeline for training and evaluating the Nil Pick machine learning model.

The input pipeline represents how the data will be passed into the model for each step of training or evaluation. Since training the model requires thousands of steps, the input pipeline must be as efficient as possible.

The store dataset we created was stored in a pandas DataFrame store_dataset. Since the DataFrame is not the most efficient data storage for the input pipeline, we’ll need to perform additional processing to create a more efficient solution.

Splitting the Store Dataset

There are two main components in creating a machine learning model: training and evaluation. Training is the foundation of machine learning, but the evaluation is just as important. Model evaluation gives us a concrete idea of just how good the model is after training, and it allows us to compare the performances for different configurations of the model.

Now, how do we decide the amount of data to use for training and evaluation. Using more data in training would potentially improve the model’s performance, but it would limit us in how accurate our evaluation is due to the limited evaluation set size. On the other hand, having a larger evaluation set would give us more confidence in our evaluation process’ accuracy, but it might limit the amount and diversity of the data in training.

Depending on the use-case make a prudent choice, we choose a 90–10 split, meaning that the training set comprises 90% of the final dataset while the evaluation set comprises 10%.

TensorFlow Example Object

To optimize the input pipeline, we want to convert each DataFrame row into a TensorFlow Example object. By using Example objects in the input pipeline, we’re able to efficiently feed the data into a machine learning model.

Write Example Data to TensorFlow Records

Now that we’ve completed the function to convert each DataFrame row into an Example object, we can create the efficient input pipeline storage for both the training and evaluation sets. The data storage will be in the form of TFRecords files, which hold serialized Example objects.

The write_tfrecords function (shown below) writes the data from a given DataFrame into a TFRecords file. It uses the create_example function to convert each row of the dataset into an Example object. Each Example object is then serialized and written into the TFRecords file.

We then used the write_tfrecords function to write the training set’s serialized Example data into a TFRecords file called train.tfrecords and the evaluation set’s serialized Example data into a TFRecords file called eval.tfrecords. These files will then be used in the input pipeline for the machine learning model.

Serialize TFRecords

The data is stored as serialized Example objects in TFRecords files. To efficiently parse the Example object in the input pipeline, we need to create an Example spec.

The Example spec is a Python dictionary, mapping feature names to FixedLenFeature objects. For our Store dataset, each of the FixedLenFeature objects has the shape (). This is because each feature contains a single value per data observation.

For both training and evaluation, we require the data to be labelled to calculate the loss for our machine learning model. Since our model is trained to predict Nil Pick, we use the 'Nil Pick' feature as the label for each data observation.

We then parse feature data for a single Example (which represents data for one DataFrame row) using the tf.parse_single_example function.

Training and Evaluation TFRecords Dataset for the Model

We are finally ready to create TensorFlow datasets from the TFRecords files for both training and evaluation.

The TFRecords datasets contain serialized Example objects. Using the Example spec and feature parsing functions, we then convert each serialized Example to a tuple containing the Example’s feature data and label for both training and evaluation data.

The TFRecords dataset’s map function allows us to apply the parsing function (parse_fn) to each serialized Example in the dataset. Since the parse_features function takes in two arguments, and map can only be used on functions with one argument, we use a single argument lambda function to wrap around parse_features.

Also, we apply uniform random shuffling with the buffer size of 100000 and configure dataset batches (both for training and evaluation), so that each training/evaluation step contains multiple data observations.

Model Input Layer

Our dataset now contains the feature data and label for each observation. We have to convert this feature data into an input layer for the machine learning model. To do that, we first need to set up the necessary feature columns.

Numeric feature columns are used for the numeric data in our dataset, i.e. the quantifiable data. Three features contain numeric data: 'Store', 'Containter_404', and'Bad_Subs' .

The reason we create feature columns for each of the input data features is so that we can easily make the input layer vector for the machine learning model.

Our machine learning model follows the standard MLP architecture. This means that it is made up of multiple fully connected layers, where each hidden layer uses ReLU activation and the final layer uses no activation. The input layer for the MLP consists of a batch of data observations from the input pipeline.

NOTE: Larger models (i.e. more hidden layers and nodes) have a higher potential to make more accurate predictions, but they can also take longer to train and have a higher chance of overfitting. It’s good to experiment with different model sizes, so we can ultimately choose the best model.

For our MLP model, we started with 2 hidden layers. The first hidden layer contains 200 nodes, while the second contains 100.

We initialize the hidden layers in the constructor of the SubstitutionModel class.

Classification vs Regression

There are two main use-cases of MLPs in the industry: classification and regression.

Classification refers to predicting a class for a data observation, given its feature data. The other main usage of MLPs in the industry is regression.
Regression refers to predicting a real number value for a data observation, given its feature data.

For our problem statement, we intend to leverage the regression function to predict the Nil Pick based on the Bad Subs and Volumetric Issue features.

Regression loss

For regression models, two main loss functions can be used to train the model: mean absolute error MAEand mean squared error MSE. Mean absolute error takes the average absolute difference between the labels (in our case, the actual value for the Nil Pick) and the model’s predicted values. Mean squared error takes a similar approach, but uses the squared difference rather than the absolute difference.

The MSE (also known as L2 loss) amplifies large error values (e.g. a difference of 1000) and minimizes fractional error values (e.g. a difference of 0.01), due to the squaring operation. Since we’re predicting Nil Pick that ranges from 50 to 200, the MAE (also known as an L1 loss) is preferred to avoid unnecessary error amplification.

Using the feature columns function from the Model Input Layer section, we created the input layer for our model. The model’s input layer is just a vector that comes from combining all the numeric feature values in the Store dataset.

The tf.feature_column.input_layer function allows us to easily convert a dictionary of parsed feature values and a list of feature columns into an input layer for the model.

TensorFlow EstimatorSpec with Training Mode

There are three phases to completing the machine learning model: training, evaluation, and prediction. With TensorFlow, we can easily bundle the three phases into a single function using EstimatorSpec objects for each phase.

The EstimatorSpec object has three modes corresponding to the three phases:

Training: tf.estimator.ModeKeys.TRAIN
Evaluation: tf.estimator.ModeKeys.EVAL
Prediction: tf.estimator.ModeKeys.PREDICT

We then create and return the EstimatorSpec object for model training.

Return tf.estimator.EstimatorSpec initialized with mode as the required argument and loss and train_op as the loss and train_op keyword arguments.

global_step is used to keep track of the total number of training steps taken during multiple different training runs. It is equal to tf.train.get_or_create_global_step applied with no arguments.

To minimize the model’s loss during training we have used the ADAM optimization method, via the AdamOptimizer object.

adam is equal to tf.train.AdamOptimizer initialized with no arguments.

TensorFlow EstimatorSpec with Evaluation and Prediction Mode

Evaluation Mode: When evaluating the model, we use mean absolute error as the metric. This is because our goal is to get the model’s Nil Pick predictions as close to the actual labels as possible, which is equivalent to minimizing the mean absolute error between predictions and labels.

Prediction Mode: For the prediction mode in the regression function, we initialized and returned an EstimatorSpec object containing a dictionary with the model's predictions. The model's predictions need to be in a 2-D tensor format, with the shape (batch_size, 1).

Using the 1-D tensor version (which was used in calculating the loss) will result in an indexing error when making predictions on a TFRecords dataset.

Regression Model with Estimator Object

The entire regression model, from training to evaluation to predictions, can be encapsulated in a single Estimator object. The Estimator object is initialized with the regression function, as well as a few keyword arguments.

One of the keyword arguments is model_dir, which represents the path to the directory that contains the model's checkpoints. The checkpoints are how we save and restore the model's parameters for training, evaluation, and making predictions.

Another keyword argument we used is config, which specifies a custom configuration for the model. For our regression model, the only custom configuration we set was the logging frequency i.e. how frequently the model will log the loss and global step values to the screen during training.

Training with the Estimator

We then trained the model using the train.tfrecords file we created in the Training and Evaluation TFRecords Dataset for the Model section. The Estimator object contains a train function that was used to train the model.

The train function's only required argument is a function that takes in no input arguments. This function should set up the input pipeline for the model training.

In our case, it returned the training dataset using the get_traing_data function from the Training and Evaluation TFRecords Dataset for the Model section.

Evaluating with the Estimator

We trained the model long enough that the loss begins to show signs of convergence (for our 2 hidden layer MLP model, this was around 2M training steps).

We evaluate with the Estimator in almost the same way we train, with the main difference being that we use the evaluate function rather than the train function.

The evaluation dataset is contained in the eval.tfrecords file The batch size for creating the evaluation TFRecords dataset only affects evaluation speed. The larger batch size can provide a speedup in evaluation, although you have to make sure the batch is small enough to be contained in memory.

We used a batch size of 20 in our evaluation, refer get_eval_data function defined in the Training and Evaluation TFRecords Dataset for the Model section.

Predictions with the Estimator

After continuous training and evaluation of the Nil Pickprediction model, we tried predicting the Nil Pick values for the test dataset.

Using the Estimator object's predict function, we made predictions on the unlabeled dataset one observation at a time (i.e. a batch size of 1). The predict function returned a generator object, which we converted into a list of the prediction values.

Model Graph

Figure 6. Graph depicting the conceptual view of the ML Model

Final SubstitutionModel Class

Conclusion

We experimented with the number of Nodes for the 2 hidden layers of the model for training with an excess of 400K iterations.

Figure 7. Model Loss for different hidden layer configurations

After multiple runs of the model, it was found that the 2 hidden layers with 200 and 100 nodes had better accuracy in predicting the Nil Picks. This exercise validates our hypothesis of Nil Pick prediction based on the quantitative distribution of its causes across the Stores.

…and in the end, while developing any machine learning model it is of utmost importance to identify and weigh different deep learning methodologies in trend and experiment with the same as these techniques may or may not be an ideal fit for our use-case but they may aid in solving our problem statement with some quirks (hybrid models) subjected to the large training dataset.