Needle in the hay

Published in

Scalia

4 min readSep 27, 2017

As retail is shifting online, competition is now just a click away. Hence, retailers have to step up their game and part of it means to offer full and accurate product data. Unfortunately, data collection is a 80/20 game where the missing 20% takes about 80% of your time. In some cases, some data points might take ages to be found which could be very frustrating. On this quest to the missing data, retailers ends up crawling the whole web manually by themselves which turns out to be time consuming and offers poor quality results.

At Scalia we’ve developed a smart web scraping feature which crawls the web automatically in order to look for your missing product attributes. We have experimented and iterated on various techniques for months and we are still doing it. Below is a list of the few tricks that we are still using.

First it must be known that a lot of web-scraping libraries already exist in many different programing languages (for instance BeautifulSoup in Python). They are very interesting tools : you only have to read the HTML page and identify the tag, the class or the id that surrounds the required data and tell the algorithm to retrieve the data from this precise spot. Some scrappers don’t even require any coding skills. Scrapper.io is one of them since it only requires you to select the zones where you want to retrieve the data and does all the coding by itself. With this way of doing, you can retrieve any data you want like text, links to images or to other pages or even other tags. Yet regular web scraping is great for quantity but not for quality. You usually end up with large amount of raw data that you need to analyze manually. That’s where we add the smart layers.

First smart layer: Leveraging Google search results

We scrap over 70 websites and we expect to reach over 100 by the end of the year. But if we had to go through each and every one of them for every SKU that we get, we would be highly inefficient. Hence, the scrapping always starts by a basic automated Google search. Google being the ultimate crawler, for each product, we look which websites emerge on their first page and we focus our efforts on these ones only.

Second smart layer: Crossing referencing through our central taxonomy

Our algorithm enables us to match category trees and attribute lists all the time. By doing so we managed to compile one of the greatest taxonomy asset within the e commerce landscape. We leverage this asset in order to classify raw scrapping results. Hence, we have more than 17 potential naming variations of the description field, we can easily regroup all of them and push it back in the exact place you want.

Third smart layer: Just looking for the data point you are missing

At Scalia, we never start our scrapping from scratch. We always use some sort of raw import from our customers. Based on this import, we identify the missing attributes and we only look for it. By doing so, we avoid all the tags you already have. Doing so streamline our output to exactly the data we want. It saves us a lot of time and it enable us not to be spotted by most of the scrapping blockers.

Fourth smart layer: Digging into unstructured data

As fast fashion becomes mainstream, more and more retailers have reduced their HTML structure and injects several data points into one big tag. That’s why we apply a second scanner based on Natural Language Processing on this specific field to check if ever the attributes we are looking for are within. On certain websites, doing such complementary filtering increases our scrapping results by an extra 61%.

All of these techniques have been consolidated together and embedded into our “Smart Web Scraping” feature. It allows retailers to enrich their catalogs and fulfill their empty data by scraping many different websites. The results are presented by sources and each of our customers can pick whichever source they want to consolidate. In order to do so, we help our customers by rating our sources in accordance to the relevance and the worthiness of its results. The retailer is free to think in a different way and to sort the sources otherwise. The output will finally notify the retailer how many new attributes could have been filled with new values.

Thanks to this smart scraping, the consumer has a complete overview of the items he is going through and can more easily make his choice. In fact, it appears that users spend only a few minutes on a website and will skip to another one if they don’t quickly find what they want (the source). It clearly appears that offering the users what he wishes easily will be satisfying and will benefits the retailer.

Scalia is a SaaS platform which help lifestyle and fashion brands to share their product data with retailers and other business partners. Thanks to machine learning algorithms, we consolidate, enrich, standardize and synchronize product data so that anyone can access it in the format they want, offering flexibility, control and consistency to the whole ecosystem.

Needle in the hay

Written by Clémence Lévecque