Price Intelligence in e-commerce

Sethukumar Ramachandran
Quinbay
Published in
8 min readDec 23, 2022
Photo by Brett Jordan on Unsplash

For a typical e-commerce establishment that sells millions of products (or Stock Keeping Units, SKUs), it is important to optimise the revenue and be profitable in course of time. To do so, one needs to know several characteristics of the product such as the top-performing category, the number of orders placed, locations from where the product is being purchased, seasonality-linked product purchase behaviour, pricing of similar products on competitor websites, SKUs sold under specific promotional events like a flash sale, campaign etc. It is also important to baseline our observation against competitors to position the right product with the right price for the right location etc.

To enable this, one needs to crawl, analyse, monitor and track various e-commerce websites and make educated changes in price at speed and scale. This requires identifying the marketplace, and competitors and monitoring the key information at regular intervals and taking appropriate decisions.

Price crawling provides the raw data necessary to gather intelligence which helps companies to strategically price their products and services based on market conditions to improve profitability.

Key Performance Indicators (KPI)

Key Performance Indicators (KPIs) are quantifiable indicators of progress towards an intended result. For the price intelligence platform, KPIs can be defined as follows.

· Coverage — Measured as a percentage of configured SKUs that could find a potential match with a competitor site. Ex., you want to crawl 100 SKUs, which in turn are configured at the crawler level. If you find 80 SKUs on the competitor’s website, then the Coverage is measured as 80%. This also means that 20 SKUs are NOT discoverable on the competitor’s website.

· Accuracy — Measured as a percentage of products which are mapped correctly (accurately). From the above example, out of 80 SKUs which was crawled, let us say, 40 SKUs got mapped correctly. Then the accuracy is 50%.

· Speed — Measured as the time taken to complete a cycle of price intelligence starting with data gathering till analytics per unit SKU. Usually, it is reported as “X” SKUs per hour.

Process Flow for Price Intelligence

· Competitor Discovery — This is the phase where we need to identify the competitor against which benchmarking needs to be done. Ex., Amazon, Flipkart, Alibaba etc., against which we need competitor information.

· Crawling — Using the identified competitor details, different crawlers are implemented to crawl and scrape product information from competitor’s websites and store it in a datastore as raw product information

· Attribute extraction: The collected raw data is enriched with specific product attributes like name, variant, brand, weight, dimension, colour etc. In other words, attributes of the product that have the potential to be used as search terms for our purpose are extracted from the raw data and stored for further use.

· Data quality check — In many cases, this is a manual process where few samples are verified to ascertain whether the data extracted is of desired quality for further processing. The first KPI (Coverage) is measured during this phase. To illustrate, for a particular category, if the coverage is less than 70%, then the data quality is NOT good for further processing. Usually, an RCA is done here and the crawling and scraping process is fine-tuned.

· Mapping — The focus is to match target attributes against the configured ones. As an example, for a particular category, you can set the success criteria as an 80% match of the target and configured attributes. This is an automated process, typically handled by a Regular Expression matching algorithm. Mapped output is stored in a Database for further processing.

· Analysis — The second KPI (Accuracy) is measured as part of the analytics performed on the mapped data. This is a manual process to be automated later and is carried out by a sampling of the mapped output for accuracy (approximately 1% of the total SKUs in a batch). As an example, for a particular batch, if the accuracy is less than 80%, then the entire batch is NOT considered for further processing. Usually, an RCA is done, and the matching algorithm is fine-tuned.

· KPI collection and create dashboard — Here all the KPIs, Coverage + Accuracy + Speed, are gathered and a dashboard is created for each category.

Challenges faced and their resolution

There are various challenges faced by the price intelligence platform ranging from size, speed, and the accuracy of matching comparable products across sites, discovering products at competitor sites, handling huge amounts of data, being blocked by BOT detectors etc. Some of these issues are discussed here with resolution strategies.

SKU size — For any e-commerce organisation, the number runs into millions of sellable SKUs at any given point in time. This is a huge challenge to cover all the SKUs in a given time window, to make sense of the data and take action. This calls for a multi-threaded crawler, attribute extraction and mapping architecture. By making such a change in basic architecture, we improved the speed from a few thousand SKUs per week to a few million SKUs per week. This also included fine-tuning the infrastructure (like CPU and memory of Pods) to cater for multi-threaded architecture.

Volume and Velocity check — Crawling is treated as a legitimate search in competitor websites., something like SEO searches. However, if it goes beyond a threshold, it can be detected as BOT and blocked. This puts restrictions on how many SKUs can be crawled and at what frequency which affects the speed.

To overcome this issue, we employed proxy infrastructure or proxy pool. Each crawler request is going out from different source IP, thereby controlling the volume and velocity. Another strategy employed is to introduce a random delay between subsequent requests from the same IP, to ensure the request pipeline is not throttled at the competitor site. We also randomised the proxy IP picked up by the crawler thread to ensure democratic usage of the public APIs exposed by competitor websites.

Non crawlable links, broken links, unresponsive websites

This is done as part of our data quality check and marked these SKUs as “Not under consideration”. These issues are beyond our control and hence needed to be removed from KPI calculations — Coverage

Out-of-stock products

This is done as part of a data quality check. These are recognised in diverse ways in different competitor sites. For some, there is an automatic redirect happening at the competitor site which leads to a different variant of the same product. When we flag this as “out of stock” the SKUs are removed from KPI calculations — Accuracy

API (Application Programming Interface) upgrades

Each competitor has public APIs which are searchable by typical SEOs. However, sometimes they get upgrades or need more parameters etc which in turn needs to be handled by the crawlers on regular basis. This requires constant maintenance on the crawler logic, search attributes used etc.

Crawler Customisation

For the same search term, we get different outputs from different competitors based on their unique SEO optimisation criteria. Sometimes, we get paginated and non-paginated outputs and there is no standardisation on this. The product attributes like brand name, product name etc. are not standardised — sometimes a brand name is interchanged with the product name and vice versa. This calls for the customisation of crawlers for each category and for each competitor to get the best results.

Quality of Data

It is quite common to see different descriptions on different websites for the same SKUs as there is no standard followed to build the product catalogue. Also, it is common to see the same SKU appearing in multiple categories. This results in poor accuracy.

To overcome this, we employed a multi-layered approach. If the coverage is low in the first pass, we change the search criteria and re-run the algorithm on a subset of data. Similarly, if the accuracy is low in the first pass, we fine-tune the mapping algorithm on the rejected data and run it again.

We also used other attributes like image matching using Artificial Intelligence (AI)/Machine Learning (ML) algorithms which resulted in better coverage and accuracy. However, doing AI/ML on millions of SKUs is a slow and costly process. Hence, running AI/ML on the failed data subset helped us to improve speed and accuracy.

Learning and Outcome

Out of several million SKUs, we started with a subset of 3 million SKUs spread into various categories. We observed that 20% of the SKUs are either non-discoverable (broken links, unresponsive websites etc.,) or out of stock.

Of the 2.4 million SKUs in the 2nd iteration, we found that there was a category mismatch. Also, if we crawl based on the category, the result was better as compared to blind crawl. We also needed to fine-tune the search criteria so that the coverage is ~66% (the target is 80%).

Out of ~1.6 million SKUs in the 3rd iteration, we need to change the mapping criteria based on the category so that in some categories the accuracy goes well beyond 90% and for some, it is hovering around 85%. This is a tedious manual sampling process and we decided to sample ~1% of the data in each category.

Overall, we started with 3 million SKUs but with coverage and accuracy issues, we could get accurate data for about 1.44 million SKUs only. From a business point of view, about 50% of the SKU data are available at the end of the cycle. It used to take close to 4 weeks but with all the changes described above, it is taking about 3 days now for 3 million SKUs.

Further, the output of this exercise is fed to category managers who will fine-tune the price as per their needs. It also means the historical data must be preserved and retrieved at scale for further analysis.

Other applications

The crawler, Attribute extractor and Mapper are decoupled components and hence can be used for different purposes. Once you map the SKU data to the accuracy level you want (like 85% etc.), you need not go through all the phases mentioned in the process view (Figure). Instead, you can do crawling only and extract the price. This is typically referred to as a Price Refresh.

We can acquire a variety of SKU-specific attributes, compare them to our dataset, and improve the catalogue as a result (if needed). This is typically known as Pristine or adding pristine to the catalogue.

By crawling rival websites, we can acquire a competitive perspective and learn about the location and seasonal purchasing patterns. The ability to support 2-hour delivery, etc. can be developed by having a hyper-local presence.

Instead of crawling every category, identify the top Gross Merchandise Value (GMV) category OR top selling category OR collection of a few such categories. Fine-tune the entire engine (crawling, attribute extraction mapping) and use this as a top focus SKU, say 1 million SKUs. Price refresh can be done every week to see the changes and adopt different strategies based on the results.

--

--