Mercari’s Image Classification Experiment Using Deep Learning


Hi, my name is Takuma Yamaguchi. I am a software and machine learning engineer at Mercari.

These days, Artificial Intelligence (AI) is a very popular buzzword. We also often see terms, such as Deep Learning and Deep Neural Networks, which are both subsets of AI and machine learning. I would like to share our image classification experiment using deep learning.

Neural Network Winter

Deep learning is a variation of neural network techniques. At the 7th International Conference on Document Analysis and Recognition (ICDAR 2003) held in Edinburgh, Scotland, Simard et al. (Microsoft Research) said in their paper Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis that:

After being extremely popular in the early 1990s, neural networks have fallen out of favor in research in the last 5 years. In 2000, it was even pointed out by the organizers of the Neural Information Processing System (NIPS) conference that the term “neural networks” in the submission title was negatively correlated with acceptance.

(I actually attended this conference as a student and I made a presentation on digit detection and recognition.)

At that time, many researchers were being attracted to other algorithms like Support Vector Machine (SVM) and others.

What Brought Deep Learning Back?

Some researchers, such as Yann LeCun, Geoff Hinton, Yoshua Bengio, and Andrew Ng, continued to study neural networks. Thanks to their achievements, algorithms based on deep learning have achieved better results than other algorithms in many tasks and competitions, including ILSVRC2012.

I really like this talk of theirs about the struggles during the Neural Network Winter: Deep Learning Gurus Talk about History and Future of Machine Learning.

After seeing what this group of researchers achieved, many others started using deep learning techniques again.

Due to the large amounts of data and huge computational resources required, breakthroughs in deep learning algorithms, coupled with the latest hardware improvements, have made deep learning more practical than ever before.

This TED talk by Fei-Fei Li (director of Stanford’s Artificial Intelligence Lab and Vision Lab) helps to further explain why why large amounts of data are needed for AI.

(I visited Stanford University two years ago, but unfortunately I wasn’t able to meet her…)

Image Classification

One of the practical applications of deep learning is image classification/object recognition. We prepared a data set of 1 million images within Mercari, taken from 1,000 categories (so 1,000 images per category). We used 90% of the images for training and the other for evaluation.

Sample Images


We conducted our image classification experiment using TensorFlow. TensorFlow is Google’s open source machine learning library. It’s not just for neural networks and can be used for a variety of other machine learning tasks.

We used the Inception-v3 model, a powerful image classification algorithm based on deep neural networks. In order to train the model from scratch, we needed data in the TFRecord format. A script for converting images to TFRecord format data is included in the repository.

ImageNet is a common academic image data set for image classification. The data set is described in the TED talk above. The Inception-v3 model also uses this data set as a training example. Although we didn’t depend on ImageNet, we did use as described in the README. This algorithm is available for any image data set, and works as-is for 1,000 or fewer categories, each category having around 1,000 images.

And, even if your data has more categories, all you have to do is change the number of categories/images in

Then, you can run the training script.

# Build the model. Note that we need to make sure the TensorFlow is ready to
# use before this, as this command will not build TensorFlow.
bazel build inception/imagenet_train
# run it
bazel-bin/inception/imagenet_train — num_gpus=1 — batch_size=32 — train_dir=/tmp/imagenet_train — data_dir=/tmp/imagenet_data

Environment & Parameters

In recent years, it has become more common to use GPUs for machine learning. It is possible to train the model without GPUs, but this may take several months to obtain practical results.

Even when using a single GPU, due to GPU memory limitations, the batch size for training the Inception-v3 model should be less than or equal to 32 for our environment (AWS EC2 p2.xlarge) using a single TESLA K80 GPU. In general, a larger batch size leads to better results.

One of the comments in states:

# With 8 Tesla K40’s and a batch size = 256, the following setup achieves
# precision@1 = 73.5% after 100 hours and 100K steps (20 epochs).

Based on this comment, we used p2.8xlarge that has 8 TESLA K80 GPUs and set the batch size to 256.

bazel-bin/inception/imagenet_train — num_gpus=8 — batch_size=256 — train_dir=/tmp/imagenet_train — data_dir=/tmp/imagenet_data


One of the features of TensorFlow is TensorBoard. This allows us to monitor and check the training status and models through a web browser.

Sample of the Model

Some Metrics

Training Loss

We decided to stop the training at 90K steps. If we had kept training, the training loss would continue to improve little by little, and the final results would be better. However, we decided to stop at 90K due to time constraints.

2016–12–21 07:40:47.901352: step 89910, loss = 5.92 (141.0 examples/sec; 1.816 sec/batch)
2016–12–21 07:41:06.331693: step 89920, loss = 5.59 (144.3 examples/sec; 1.774 sec/batch)
2016–12–21 07:41:25.166112: step 89930, loss = 6.59 (109.3 examples/sec; 2.341 sec/batch)
2016–12–21 07:41:43.155784: step 89940, loss = 5.45 (147.1 examples/sec; 1.740 sec/batch)
2016–12–21 07:42:01.680773: step 89950, loss = 6.84 (145.6 examples/sec; 1.759 sec/batch)
2016–12–21 07:42:20.002877: step 89960, loss = 6.98 (144.5 examples/sec; 1.772 sec/batch)
2016–12–21 07:42:38.857091: step 89970, loss = 6.56 (142.1 examples/sec; 1.801 sec/batch)
2016–12–21 07:42:56.732429: step 89980, loss = 6.04 (142.7 examples/sec; 1.794 sec/batch)
2016–12–21 07:43:14.753710: step 89990, loss = 6.13 (142.8 examples/sec; 1.793 sec/batch)
2016–12–21 07:43:33.722591: step 90000, loss = 6.48 (148.0 examples/sec; 1.729 sec/batch)

Since it took around 2 days for 90K steps with K80s and 100 hours for 100K steps with K40s, K80 may be 2x faster than K40.


# Build the model. Note that we need to make sure the TensorFlow is ready to
# use before this, as this command will not build TensorFlow.
bazel build inception/imagenet_eval
# run it
bazel-bin/inception/imagenet_eval — checkpoint_dir=/tmp/imagenet_train — eval_dir=/tmp/imagenet_eval

Finally, we got:

precision @ 1 = 0.4332 recall @ 5 = 0.7033


The accuracy was worse than we had expected…

Here are some possible reasons for why we got such a result.

Some categories are very similar, such as, “Men > Shoes > Sneakers” and “Women > Shoes > Sneakers.” Also some varieties of clothing for men and women tend to share similarities.

Moreover, some categories such as “Tickets” cannot be recognized without OCR (Optical Character Recognition). This is needed, for example to classify tickets for events featuring Japanese artists or foreign artists, as well as things like bus and train tickets, etc.

Sample Results

In tensorflow/models/inception, the scripts are used for batch training and evaluation. When we want to use a trained model for non-batch image classification tasks, we only need to write about 20 lines of code.

import tensorflow as tf
from inception import inception_model as inception
from inception import image_processing
image = image_processing.image_preprocessing(tf.read_file(‘/path/to/jpg’), bbox=[], train=False)
image = tf.reshape(tf.cast(image, tf.float32), shape=[1, FLAGS.image_size, FLAGS.image_size, 3])
logits, _ = inception.inference(image, num_classes=1001)
scores = tf.nn.softmax(logits)
top_k = tf.nn.top_k(scores, k=5)
variable_averages = tf.train.ExponentialMovingAverage(inception.MOVING_AVERAGE_DECAY)
variables_to_restore = variable_averages.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
with tf.Session() as sess:
init = tf.initialize_all_variables()
ckpt = tf.train.get_checkpoint_state(‘/path/to/train/dir’)
saver.restore(sess, ckpt.model_checkpoint_path)
 top_k_values, top_k_indices =
print(‘Label IDs: ‘, top_k_indices)
print(‘Scores: ‘, top_k_values)

Classification score is not treated in the imagenet_eval. Since it would be helpful to know the confidence of the classifications, we added this line of code scores = tf.nn.softmax(logits). This value is actually calculated in the inception model, but is not returned.

After this, we get the classification results for the image.

(‘Label IDss: ‘, array([[ 64, 202, 206, 292, 600]], dtype=int32))
(‘Scores: ‘, array([[ 0.57124287, 0.37565342, 0.00791241, 0.0067259 , 0.00576101]], dtype=float32))

We applied this trained model to some images.


score — category

0.296 — Men > Shoes > Sneakers

0.067 — Sports > Other Sports > Basketball

0.045 — Sports > Other Sports > Athletics

0.041 — Babies & Kids > Shoes > Sneakers

0.310 — Women > Shoes > Sneakers

0.296 — Men > Shoes > Sneakers

0.067 — Sports > Other Sports > Basketball

0.045 — Sports > Other Sports > Athletics

0.041 — Babies & Kids > Shoes > Sneakers

0.173 — Electronics > Audio Equipments > Earphones

0.124 — Electronics > TV/Video Equipments > Cables

0.113 — Electronics > Audio Equipments > Cables

0.098 — Hobbies > Video Games > Game Consoles

0.072 — Electronics > Beauty & Health > Hair Irons

0.585 — Hobbies > Toys & Stuffies > Stuffies

0.107 — Babies & Kids > Toys > Music Boxes

0.095 — Babies & Kids > Toys > Rattles

0.028 — Women > Accessories > Key Rings

0.019 — Hobbies > Toys & Stuffies > Character Goods

0.663 — Others > Groceries > Fruits

0.300 — Others > Groceries > Vegetables

0.005 — Home > Annual Events > Gifts

0.002 — Others > Groceries > Processed Foods

0.002 — Others > Groceries > Others

0.665 — Cars & Motorcycles > Cars > Cars (non-Japanese)

0.192 — Cars & Motorcycles > Cars > Cars (Japanese)

0.055 — Cars & Motorcycles > Cars > Catalogs

0.008 — Cars & Motorcycles > Motorcycles > Motorcycles

0.006 — Cars & Motorcycles > Cars > Car Parts (Japanese)

Looks like something we can use!


I hope you enjoyed reading about our image classification experiment.

Thanks to TensorFlow and other OSS, we didn’t need to write any new code. As you can see, knowledge of machine learning is not necessarily required for a simple image classification task. Nowadays, all we need are large amounts of labeled data and huge computational resources for image classification tasks.

Deep learning methodology is helping to dramatically improve not only image classification, but also other machine learning applications such as speech recognition, natural language processing, and so on. However, deep learning still has some problems, e.g. the huge computation time and training difficulties.

Currently, I am interested in model compression of deep neural networks, since deep models usually have a huge number of parameters. If they could be represented by fewer parameters, deep learning would be even more useful.

For now, I hope that we will be able to provide a better user experience through machine learning!

Like what you read? Give MercariEng a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.