Predicting Sneaker Resell With Deep Learning

Tony Zhang
The Startup
Published in
6 min readSep 19, 2020
A DCGAN learning to generate unique footwear silhouettes

TL;DR: Implementing a pre-trained VGG16 model to perform price prediction on popular sneaker models (or any footwear for that matter!), obtaining a final test validation loss of 34k, representing an average prediction error of $184 or ~30% error. GAN generated images at the end.

Part 0: Introduction

The night before the release of any sneaker, resellers and enthusiasts rejoice in online forums or chat groups discussing the potential resell of the next day’s release, from Air Jordans, to Yeezys, to any big name collaborations, you can count on sneaker reselling to play a major part in the current streetwear culture. The problem however, lies in the fact that there isn’t really a reliable method to accurately and quantitatively gauge resell, as most predictions are based on factors like the level of hype or similar past releases; and although the sneaker community has become quite proficient (most of the time) at predicting whether or not a sneaker will have resell, estimations of the actual resell price is still anyone’s guess.

As somebody who loves working with data as well as being a long time sneaker enthusiast, I decided to use machine learning to take some of the guess work out of sneaker resell predictions.

My Approach: Utilizing data from popular fashion and resell platforms (StockX, Farfetch, etc.), I wanted to train a deep learning model to perform price regression on sneaker image data using sale price as the label. The first task of this project was to crawl the web to scraping image and price data of various sneakers (or men’s footwear in general), then to use this data to train a CNN to perform price regression.

I also have some fun generating unique footwear silhouettes using GANs to figure out what a computer sees as distinguishing features between cheaper (< $500) and more expensive (>$500) shoes.

Part 1: The Data

Using custom web crawlers to crawl StockX and Farfetch, a total of ~13000 unique (for the most part) sneaker images along with their prices were collected, 863 from StockX comprised mostly of sneakers, and 12,034 from Farfetch, comprised of sneakers as well as general men’s footwear. A random sample of the data can be seen here:

Sample of the training image data

As you can see, the dataset is comprised of a good mix between sneakers, high-fashion footwear, and everyday footwear. Next, let’s have a look at the price labels associated with our images:

Raw data statistics

Prior to any data cleaning, we can see that the price labels are heavily right skewed with several really high priced sneakers going as high as $38538, and from a quick statistical check, we can calculate that prices over $1367 are considered outliers in my particular dataset. My initial decision was to keep these prices in the data, as they represented actual sneaker resell prices and perhaps my model would learn to predict even the highest of resells. However, from my first initial attempts at training, my model was converging on average errors of $600+, which is even worse than just predicting the mean each time. After some data cleaning and removing the outlier prices:

Clean data statistics

By removing 537 image/price pairs from my dataset, the distribution becomes much more apparent, albeit still right skewed, and overall training performance has also increased significantly (more on this later), therefore I am comfortable removing ~4% of my data for a huge boost in prediction accuracy. Furthermore, I think the argument can be made that most sneakers on average will resell for under $1000 (based on my data), therefore, I believe it’s much more valuable to train a model to strongly predict within this range, than to poorly predict a larger price range.

Part 2: The Model

After an exhaustive process of hyperparameter tuning and model testing, I found VGG16 to converge on the lowest validation loss out of other comparable models including VGG19, ResNet(10/50/101), DenseNet121 and InceptionV3. I found that the single most important factor which helped the model drastically improve was the quality of my dataset, such that if I didn’t remove the 537 outlier prices, my model converged at a validation loss of 140k (approximately $374 in error):

Model learn curve before removing outliers, convergence at ~140k

However after removing these labels, I was able to achieve convergence at 35k ($187 error) with a test loss of 34k representing an average prediction error of $184 or ~30% error:

Model learn curve after removing outliers, convergence at ~35k

In terms of other parameters, I employed a learning rate of 0.001 with an LR scheduler to help with convergence and a batch size of 64 (this was as high as I could go without the GPU exploding). My train/validation/test split was 70:15:15, and used MSE Loss as my regression loss function with model convergence at ~60 epochs. I found that both batch normalization and unfreezing the weights to not show any significant improvements to the validation and test errors for my dataset.

Now let’s test the model to perform some price inferences! All of the shoes tested were released after the initial training data were collected so there won’t be any concerns for data leakage.

Starting with Air Jordan 4 and Union LA collaboration in the “Off Noir” colorway released on Aug.29, 2020, my model predicted the price to be:

Not quite hitting the mark, but definitely within an acceptable range. Now let’s look at a less hyped shoe, the Jordan 7 Retro Greater China released on Sept. 5th 2020:

This test wouldn’t be complete without at least one Yeezy, so let’s look at one of the newer Yeezy models, the Adidas YZY QNTM:

Conclusion: I implemented a pre-trained VGG16 model to perform price prediction on popular sneaker models (or any footwear for that matter!), obtaining a final test validation loss of 34k, representing an average prediction error of $184 or ~30% error. Yes 30% is higher than what I hoped the error to be, especially for resellers looking to maximize on profit, however I believe this data-based approach is still more reliable and accurate than guessing, and could point people in the right direction in terms of purchasing decisions.

Part 3: Fun with GANs!!

Since I collected a bunch of sneaker images, I thought it would be interesting to train a GAN model to generate unique sneaker silhouettes. In particular, I wanted to see what were the distinguishing features between cheaper (<$500) and more expensive (>$500) shoes. Here’s what the images converged to after 220 epochs:

Within the cheaper category, we see more silhouettes associates with general athletic shoes, ranging from runners to skate shoes to basketball sneakers. Furthermore, Nike and Adidas dominate a large portion of shoes in this category as the GAN was able to pick on on the Nike swoosh and the Adidas three stripes. Lastly, shoes within this category appear to be more colorful.

The most obvious feature of the expensive category seems to be the use of leather, as the majority of the shoes seem to be either dress shoes or leather boots. Colors are also more muted and neutral consisting mostly of black, white and beige. It’s interesting to note that the two white sneakers on the 3rd row resemble the popular Gucci sneaker silhouette. I also note that the dataset for this category consisted of 2000 less images less than the cheap sneaker dataset, which may result in less represented images.

Source code and data available on my Github:

Webcrawlers | Price Estimator | GAN

--

--

Tony Zhang
The Startup

MSc. Student @ Tsinghua | Data Science Enthusiast