Ebay Parsing to Gather Data for Network Training

Published in

Intel Student Ambassadors

7 min readJul 26, 2019

The goal of the project was to explore python web scraping frameworks in order to outline possible steps that can be taken to gather data from online retailers that can be used to train neural networks. Originally I had resolved to just use pure scraping and no api calls as I thought it would be more fun, more generalizable, and more exciting, but as you’ll see (spoiler alert), that didn’t exactly happen.

For this project specifically, I chose to investigate the price data for dell optiplexes with i7–3770s. This was not an arbitrary choice, this is an excellent piece of old hardware, and I already had my eye on it for a potential linux server, so I figured it would be a great potential goal to pursue in the training of our model. I chose a very specific computer subset so our model would have less potential variables it would have to account for and thus a better chance at reaching reasonable results with small amounts of data. Once the data was gathered, the plan was to train a basic neural network to predict how long an item would take to sell based on factors such as RAM, hard drive size, SSD capability, and pricing.

What are the benefits?

Training a network to predict how fast something will sell actually has some directly applicable practical applications. A seller on ebay could easily use this algorithm to estimate how long it would take them to sell their item if they sold it at the average price, and also try to optimize the amount of time they’re willing to wait before selling, versus the amount of profit they make off the item. It also isn’t unreasonable to think that there is likely a direct correlation between how quickly something sells and how good of a deal it is. Now certainly this isn’t always true, but on average it seems like a fairly reasonable generalization. Thus, such a model could be used to predict potential “good deals”, and either alert the buyer or just purchase the items automatically (admittedly a much riskier, yet exciting line).

Initial Approach:

The initial plan was to grab all the data from ebay sold listings, but surprisingly nowhere on the page is there listed a posted date. I thought this would just be a matter of clicking a simple button or changing a setting but much to my chagrin I was unable to find a way to display this information. I did however stumble upon a website called watchcount.com which when given an ebay id number, could return this information for me.

Thus the initial code plan was this:

load page
make product ids appear somehow??? (couldn’t find way to do with url parameters)
grab data from watchcount.com
save to file / do additional parsing

To load the page I found just a standard python request would do and decided upon beautifulSoup for the webscraping as I had heard of the framework before and thought it would be worth a try. Running natively on linux made it not difficult to setup and with a few pip install commands, and discovering that python3 required a different version of beautifulSoup to run correctly, I was able to get the initial code up and running.

From there I used the dom inspector to grab the relevant class attributes and called the python equivalent of document.querySelectorAll(querySelectorString) to get the links and titles of the items I cared about. I then opened each page, parsed that page for the product id, and stored that so it could be used to fetch the history data from watchlist.com.

Initial Results:

On 50 items my initial program took around 1 minute and 15 seconds to run, a number I wasn’t very happy about, but considering the vast inefficiencies from having to load a new page for each URL, not all that surprising. Out of curiosity I decided to see how much multithreading would improve performance and with 10 threads was able to get the time down to around 9 seconds on my quad core processor. I was very excited about how great threading had improved performance, but realized this probably wasn’t the most scalable solution.

The results!

Multithreaded Performance Results — Multithreaded Performance

After adding the code to fetch data from watchcount.com I quickly ran into an additional problem. My 10 threads were happily chugging away when suddenly my eyes were alerted to numerous errors! Surprised I navigated to the webpage to see a message I certainly didn’t want to see — the page is receiving too much traffic, try again in a minute. After considering VPNs and other ways around this problem, I decided it wasn’t worth it and I needed another solution.

Issues: Sometimes ebay will recommend a similar item instead of the item you actually go to, which would cause my code to not properly retrieve that id. As I decided to change my approach I only printed out an error message for this circumstance, but it should be fixable by checking if grabbing the attribute is null and then recursively calling that same function on the link to take you back to the item you selected.

Problematic Case vs General Case:

Second Attempt:

In the second try I decided that one of the main bottlenecks of the program was fetching the id number and I really needed to perform that faster. After looking for values I could post in a url for a get request, and failing to find any, I decided to use Selenium, a python framework that allows the user to actually edit elements on the page (click, move mouse, etc). Using selenium I selected a drop down menu, checked show id, and was able to then use beautiful soup to quickly extract titles, etc.

Unfortunately, this didn’t solve the website traffic problem, and I decided I really should just bite the bullet and use an api call to solve this portion of the task.

Issues: Local pickup items don’t have the same identifier as items with a shipping cost/free shipping, had to write some additional code to make sure to factor in this special case.

Potentially Problematic Local Pickup Case

Third Attempt:

In the third attempt I restructured the code into a more elegant class base structure and worked on extracting the RAM, HDD size, and SSD capability of the computers listed. Thankfully for me, sellers don’t have a consistent way of formatting listings so the initial trivial regEX expressions I tried were not too helpful. After looking at a few fail cases I was able to narrow them down and come up with expressions that worked around 99% of the time, which was good enough for me (nothing we can do about the people who just list long strings of numbers or just leave out RAM data, like come on).

The Regex expressions that were used

The current version of the code works fairly well, though the ebay data doesn’t go back quite as far as I hoped so I’m only using the first 3 threads (not the 1 thread per 200 results I was planning). There’s still some more optimization to be done to ensure threading is always successful (opening multiple windows can cause selenium to attempt to grab the element before loading), but it seems to work most of the time, and almost always in the single threaded case.

With this ready to go all I needed was one ebay api call and we were done. Happily I filled out my ebay developer account application on a Saturday only to discover it would need 1 business day to process. Thrilling! Imagine my surprise when on Tuesday upon my account finally being processed I discovered that it was … rejected due to inconsistent data (even though this is the email and phone number I use on everything :( ). I’m still waiting on Ebay to get back to me on resolving this issue, but once they do I’ll post some results for the neural network and how I trained it.

Until then, happy coding!

Ebay Parsing to Gather Data for Network Training

Written by David Morley