Shopping e-commerce products by example through deep learning

Christos Rountos
May 9, 2018 · 13 min read


In this blog we present our work at DeepLab regarding a mobile-integrated e-commerce application for object classification with deep learning. A user can capture a photo with e-commerce content using a mobile and the application will suggest similar products from shopping sites. The implementation is based on a convolutional neural network trained on e-commerce images. The model is implemented with Tensorflow and it is integrated into an iPhone mobile from which we can recognize e-commerce images through live video. In the next sections, we present the training process with the respective data, practical details for the integration of the CNN into the mobile and results of the complete application which can offer an impressive shopping experience.


E-commerce is the activity of buying or selling products online. The aim of this blog is the composition of a complete e-commerce mobile application using deep learning. In particular, we are making use of a training pipeline to acquire a deep learning model for e-commerce products classification and afterwards, we are integrating this model into a mobile application. Tensorflow provides a platform to build a pipeline for training and integrating our CNNs as a tensorflow application for android or ios as depicted in Fig.1.

Several works are related to similar pipelines. Shopping by deep learning exploitation is an appealing field as seen by recent works (Shopping by example and Matching Street Clothing photos ) and powerful applications (E-commerce powerful applications 2017).

In the following section, we present the use-case for the pipeline of training and integration processes, analyzed further in later parts.

Figure 1. Pipeline for integrating a deep learning tensorflow application into a mobile device.

Use case

The ultimate goal is the development of an application which recognizes e-commerce images through live video, in order to be used for shopping by example. To this end, we present the end-to-end processes of:

  • Training a cnn for e-commerce image classification (first pipeline of Fig.2).
  • Integrating the trained model into an iPhone (second pipeline of Fig.2).
  • Developing a complete application for shopping by example (third pipeline of Fig.2).
Figure 2. End-to-end process for a shopping application. It consists of 3 consecutive pipelines. From top to bottom: 1) Train a CNN model, 2) Import the latter model into mobile 3) Video application and prompting to e-commerce site.

E-commerce image classification

We start with the first pipeline mentioned above, namely products’ classification for e-commerce sites. The French e-commerce site C-discount already exploited products’ classification based on their text description. In this work, we employ a model which classifies products according to their image content (Fig.3). For this purpose, in what follows, we present:

  • The dataset used for training and evaluation.
  • The selected model.
  • The experimental results.
Figure 3. Model capable of recognizing products’ category from image content.

The dataset consists of 12.371.293 training images, 3.095.082 test images with 180x180 resolution and 5270 labels (i.e., product category), as provided by the Kaggle challenge. Several samples of the e-commerce data are illustrated in Fig. 4.

Figure 4. E-commerce samples based on C-dicsount .

Two important details are worth mentioning: a) The provided data are products, each of which consists of 1 to 4 images. An example is included in Fig.5 where the product Desktop is related to three images. This information is important because we can exploit the image similarity per product during training.

Figure 5. Multiple views that correspond to the same product. We exploit image similarity among examples of each product during training resulting in increased performance.

b) Class hierarchy: A lot of classification problems are based on class hierarchy which means that a sample corresponds to a range of more general to more specific class labels, as depicted in Fig.6. In our case, the final 5270 classes are based on four levels of hierarchy and the classes per level are 49, 483, 5263 and finally 5270.

Figure 6. Example of three levels of class hierarchy(3–6–9 classes). In our case, there are four levels(49–483–5263–5270 classes).

The model used for training was Resnet-101, but for the final integration into mobile we chose Resnet-50 without losing much accuracy. In Fig. 7, we present the modifications integrated into Resnet for exploiting the class hierarchy. Specifically, we added three auxiliary layers in order to have 3 additional outputs according to the levels of hierarchy.

Figure 7. Resnet with three auxiliary layers, one for each level of hierarchy.

The modification integrated into Resnet for exploiting the image similarity, was the averaging of the feature maps per product before entering the last fully connected layer, as depicted in Fig.8. In case of a sample product consisting of 3 images, we average the [3, 6, 6, 2048] feature map resulting in a feature map with dimensions [6, 6, 2048] that enters the final fully connected layer.

Figure 8. Resnet with feature averaging per product before the final FC layer.

An overview of the main experiments is presented in Fig.9. The image by image methods (1 to 4 columns) are based on models trained with images as training samples. The last product by product method (column 5) is based on a model trained with products (multiple images per product) as training samples. The Accuracy reported is the accuracy on the test set.

By exploiting image similarity per product, the accuracy increased by 3% ( Method 4 vs Method 5). By exploiting class hierarchy, we did not have accuracy gains, so we preferred the unmodified, less complex Resnet-101. The training process was computationally demanding because of the large dataset as well as due to the 5270 classes.

Figure 9. Methods 1 to 4 correspond to image by image training and method 5 corresponds to product by product training. The lightest and fastest model for mobile is Resnet-50 (method 3).

Although we can exploit information cues such as class hierarchy and image similarity per product during training, the integration of such a model into mobile is not trivial. So we used the third model which is lighter and faster for the mobile integration processes. We trained Resnet-50 from scratch on 2 gpus with batch size 128, learning rate schedule 0.1 from 1 to 4 epochs, 0.01 from 5 to 6 epochs, 0.001 from 7 to 8 epochs and 0.0001 from 9 to 10 epochs.

In the next parts, we analyze the mobile integration processes and we further present coding details about problems that one might encounter when importing a CNN into an ios device.

Integration and coding details

In the following sections, we explain in detail the processes for the integration of a trained CNN into an ios device. The coding steps and the encountered problems are analyzed, before importing the tensorflow application for live video recognition of e-commerce products into the mobile. We first present the procedures for getting the model graph and after that, the processing steps of this graph for getting the final integration model.

Exporting the model architecture

In this section, we present the problems confronted for producing the model graph. Firstly, what do we mean by model graph? Tensorflow provides a way of getting the node names of a tensorflow graph and the respective operations as follows:

tf.train.write_graph(sess.graph_def, directory_to_save_model, model_name, as_text=False)

Exporting the model architecture correctly is important, because the next steps are based on this file. It is also a bit tricky and we need to pay attention during this step. First of all, we don’t recommend exporting the model by using the same script as training! When we train a tensorflow model, the scripts used are too complicated and we might have nodes in the tensorflow graph that are not used for inference. We suggest the creation of a new simple script with a placeholder as input, forward-passing the network and getting the prediction. Afterwards, tf.train.write.graph can be used for getting a simple model graph without redundant nodes.

Another important fact is the preprocessing. We can view the node names of a network using the inception example provided by Tensorflow. The inception graph needs to be downloaded:

mkdir -p ~/graphs
curl -o ~/graphs/ \ && unzip ~/graphs/ -d ~/graphs/inception5h

Now we export the node names and operations of the tensorflow_inception_graph.pb file into a text file to view them, as follows:

gf = tf.GraphDef()gf.ParseFromString(open(‘tensorflow_inception_graph.pb’, ‘rb’).read())with open(txt_file, ‘w’) as f:    for n in gf.node:        f.write( + ‘ => ‘ + n.op + ‘\n’)

We can observe that after the input placeholder, there are no preprocessing nodes and the placeholder enters the network. The reason is that the final scripts, which are written in objective-c, are taking care of the preprocessing. So, the aim is to export our model architecture without any preprocessing.

Moreover, one needs to pay attention on the node names of input and output because we will need them in the next steps. Most of the tutorials suggest to view them on tensorboard, but you can just use the above code and view them in a txt file. In our case, the input placeholder is called input_1 and the output softmax node softmax_1.

Another important practical issue is batch normalization. Tensorflow suggests the training of Resnet-50 using fused batch normalization for better performance. But the integration process of a model into a mobile device does not support this kind of batch normalization, so the model was trained with fused=True and then the architecture was exported with fused=False (and not None). The accuracy was not affected in our experiments.

To sum up, a good pipeline for getting a correct model graph is the following:

  • Separate script for a simple inference without preprocessing and redundant nodes in the Tensorflow graph.
  • Specification of input-output names for the inference.
  • Use of nodes that are supported for integration (in our case regular batch normalization instead of fused).
  • Viewing of node names and operations.

After completing the exportation of the model graph, it needs to be processed through the Tensorflow pipeline for mobile, as described in the next parts.

Final model

In the previous section, we mentioned the most important issues that we may face while exporting the first model graph file. The next steps are relatively simple but we still need to pay attention on specific details. We first clone the Tensorflow repo:

git clone

The next step is the freezing process for importing the weights through the checkpoint file into the model graph.


It’s ok remaining at master branch after cloning. We move to the root of the cloned Tensorflow repo and we freeze the checkpoint file of training into the model graph:

python tensorflow/python/tools/ \
— input_graph=path-to-model-graph \
— output_graph=path-to-frozen-model \
— input_checkpoint=path-to-checkpoint-file \
— output_node_names=softmax_1 \
— input_binary=true

This way, a new file is created with the trained weights freezed into our model. As one can observe, the size of the exported freezed graph is bigger than the previous graph which contains the nodes without the weights. Tensorflow provides a quantization process, as described in the next section, for getting a graph of small size for efficient integration.

Quantization (optional)

Most of the tutorials suggest the quantization of the frozen model. With the quantization process, the size of the model gets almost four times smaller so it can be imported into mobile easier than a model of bigger size. The inference part is also faster because of the 8-bit computations instead of 32-bit during training. Let’s see how we can produce a quantized model:

bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
— in_graph=path-to-frozen-model \
— out_graph=path-to-quantized-model \
— inputs=input_1 \
— outputs=softmax_1 \
— transforms='add_default_attributes strip_unused_nodes(type=float, shape="1,180,180,3") remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes strip_unused_nodes sort_by_execution_order'

Although we have the above benefits, this process was not working in our case for Resnet-50 (we did not use the native Tensorflow Resnet) and it is optimized only for specific models. So this step was skipped, since the provided inception graph is not quantized either.

In the next part, the memory mapping procedure is described, for getting a graph which handles memory efficiently during inference.

Memory mapping

Attempting to integrate the freezed model into mobile, resulted into memory errors due to the size of the graph. Resnet-50 is not a very deep CNN but in our case the size was large because of the number of outputs (5270 classes). The solution to this was the memory-mapping transformation of our model with this script:

bazel build tensorflow/contrib/util/convert_graphdef_memmapped_format
bazel-bin/tensorflow/contrib/util/convert_graphdef_memmapped_format \
— in_graph=path-to-frozen-model \
— out_graph=path-to-memmaped-model

The exported graph is optimized for lower memory usage into mobile. The following section presents the installations that need to be done in order to proceed with the final coding steps.


Now the memory mapped model can be imported into mobile. We explain the integration process for a macbook with High Sierra OS. If you do not have xcode you can install it by:

xcode-select — install

After the xcode installation, you need to copy this folder from the cloned tensorflow repo:


Afterwards, you should modify it according to your needs. All these steps have been done using a fabfile, but it’s better having a view of them. Now, inside your camera folder you need to install pod as follows:

pod install

The last step before importing the model into the ios device, is the modifications that need to be made into the inference script written in objective-C.

Code modifications

Before opening the xcode workspace, the preprocessing code was modified inside In our case, the memory mapping flag is set to true, because we used the convert_graphdef_memmapped_format script previously:

const bool model_uses_memory_mapping = true;

We also modified the preprocessing values, for getting the same preprocessed image as the training process:

const int wanted_input_width = 180;const int wanted_input_height = 180;const int wanted_input_channels = 3;const float input_r_mean = 123.68f;const float input_g_mean = 116.78f;const float input_b_mean = 103.94f;const std::string input_layer_name = "input_1";const std::string output_layer_name = "softmax_1";

Additionally, the following snippet needs to be modified according to the preprocessing that we used during training. In our case, we used the vgg preprocessing so we subtract different mean values from each channel without dividing by std. So, the following snippet of code is converted:

for (int c = 0; c < wanted_input_channels; ++c) {    out_pixel[c] = (in_pixel[c] — input_mean) / input_std;}

Into this:

out_pixel[2] = in_pixel[2] — input_r_mean;out_pixel[1] = in_pixel[1] — input_g_mean;out_pixel[0] = in_pixel[0] — input_b_mean;

Attention needs to be paid since the image is received in BGR format. Of course, we have to replace the labels’ text file with our labels file. The following section describes the importing procedure through xcode.

Importing the model

We have to connect the iPhone with the macbook and open the xcode workspace:

open tf_camera_example.xcworkspace

Final steps:

  • Import your apple account into xcode.
  • Choose your connected iPhone as a device.
  • In General settings update the name of bundle identifier.
  • Choose yourself as a team.

Now we are ready to build and run. The model will be imported automatically into the mobile. If you run it for the first time, you will need to make the app reliable from the general settings of your iPhone.

From the software engineering point of view, we have a folder called mobile into the working repo and the camera folder is used for the model. The tensorflow repo can be cloned as a submodule, in case of a new model creation. This way we can make easy integration attempts right after training. It’s really beneficial to use a fabfile for downloading the packages you want and also run the integration (except the xcode process) with 1 command. In the next section, live video results for e-commerce images recognition are presented.

Integration results

As one can notice, the images of Fig.10 have been captured through live video and the integrated model is still able to recognize real-world e-commerce products. This is worth mentioning, because the model is trained on simple e-commerce images (as you can see in Fig.4) and most of them have white background, which means that the performance can be boosted with a better real-world training dataset.

Figure 10. Video captures from Tensorflow ios app recognizing real-world e-commerce images with Resnet-50


In this tutorial, we presented the pipeline for deep learning exploitation into an e-commerce application which assists the users to shop by image snapshots or video (Fig.11). After completing the deep learning part, the overall construction of a shopping application is a matter of ios developing. From the coding point of view, we gained a lot of insights through the problems we had with the integration of Resnet-50 into the iPhone.

Tensorflow has also introduced tf-lite for integrating a CNN but there are a lot of improvements to be done so for now on we use the provided graph which is trained on Imagenet challenge.

The most important margins of improvement are:

  • The construction of a complete ios application that recognizes an object and prompts the user to the equivalent e-commerce site with products from the same category.
  • More experiments for improving the already existing single image by image model.
  • Experiments using real-world e-commerce datasets for performance boosts.
  • App modifications for integrating the multi-view product by product model with accuracy gains.
Figure 11. Prompting the user to similar products in a shopping site

In the future we plan to make use of the newest tf-lite pipeline that tensorflow provides. The Resnet model that we used is not well supported yet, so we can integrate it in the future through tf-lite.

Finally the integration of a model into ios mobile is proved to be a really good base for the composition of a complete application which provides outstanding shopping experience! The end-to-end pipeline is illustrated in Fig.12.

Figure 12. Final pipeline of integration processes and shopping by example. The interconnection of training-inference pipelines is the process of freezing and memory mapping. The output enters the integration pipeline for TF-app.


  1. TF-mobile (link)
  2. Kaggle challenge (link)
  3. Pete Warden’s blog (link)
  4. E-commerce powerful applications 2017 (link)
  5. Resnet CNN Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 2015. (pdf)
  6. Shopping by example Ashish Kare, Hiranmay Ghosh, Jaideep Shankar Jagannathan 2009. (pdf)
  7. Matching Street Clothing photos M. Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg 2015. (pdf)


Bridging the gap between research and industry

Christos Rountos

Written by

Machine Learning Engineer


Bridging the gap between research and industry

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade