Unsupervised Attribute Extraction for Online Listings

Debajyoti (Deb)
Prosus AI Tech Blog
9 min readOct 14, 2020
Photo by Mark König on Unsplash

I will talk about my project on developing an unsupervised approach to extract attributes from online listings, done in collaboration with OLX Group, part of Prosus. The OLX Group operates a network of online trading platforms in over 40 countries, building market leading classifieds marketplaces that empower millions of people to buy, sell, and create prosperity in local communities. In this blog, I will cover:

  • The motivation for solving this problem
  • The approach that we undertook
  • The results that we obtained.

Motivation

Customers on OLX often have specific requirements for the products they’re trying to buy online, for example, a certain RAM specification for a laptop or a brand of a car. Such specifications of the products are called attributes. These attributes can take one or more values. For example, if colour was an attribute, it could be red, blue, or green. Similarly, a laptop can have 4GB, 8GB, or 16GB of RAM.

The colour-coded text refers to values, and the legend corresponds to attributes.

In e-commerce marketplaces, merchants often invest time in creating detailed and structured descriptions, leveraging information from the item's manufacturer. Whereas, in online classifieds such as OLX, product descriptions are comparatively less structured. This is because sellers are more often consumers themselves, not professional businesses and because all products are in a sense unique (since they’re secondhand). For example, in the figure below, we can see that the product description in the case of e-commerce marketplace (right) is more structured than online classifieds (left).

Left: Online Classified, Right: E-commerce Marketplace

Benefits of having more structured information

At OLX we are continually improving the process of collecting data about listed products and becoming more granular in the information that we collect. Millions of listings are posted on OLX every day, where each listing has a unique product description. Extracting attributes from listed products allows having more structured information which can enable the buyer to make an informed choice and ensure that the item they buy lives up to their expectations. Also having more structured product descriptions enables OLX to provide:

  • More relevant search results and recommendations: having more consistent product descriptions enables the search system to retrieve items based on attribute level information. Also, more relevant products can be recommended to users based on the history of products explored and the attributes associated with them. For example, providing search results based on the brand or color of a product.
  • Automatic segmentation for products: product attribute level information helps define a taxonomy for products as it allows comparison with other related products and improve sales forecasts.

Attribute extraction and named-entity recognition

Extracting the attributes and values in an automated way helps maintain a balance between providing structured information to the buyers, without burdening the seller with strict requirements.

The goal of attribute extraction is to represent the product description as a collection of attributes and values. We want to process the title and description of a product and obtain attribute-value information.

End-to-End process for attribute extraction

The most popular approach for attribute extraction is to construct a named-entity recognition (NER) model. The task of NER is to find each mention of a named-entity in the text (values) and label its type (attribute).

Supervised machine learning-based systems have been the most successful in the NER task. However, they require correct annotations in large quantities for training. The task of manually annotating text is labour-intensive and needs domain expertise. One approach to deal with this problem is to label small amounts of data and apply Transfer Learning. However, our task is specific to extracting attribute-values from classifieds. Pre-trained NER models identify generic entity types (person, date, location, geopolitical entity, organisation, etc.); hence Transfer Learning is not suitable. So we decided to build a solution to construct labelled data for NER model in a completely unsupervised way.

Challenges

While trying to build the NER model, we encountered two major challenges.

  • Challenge 1: Constructing pairs of attributes and values. Manually constructing labeled data is time-consuming and not scalable; hence it was out of the question. Obtaining third-party data for attributes and values (by scraping the web) is an option. However, in this case, we would have to define the attributes for each category ourselves, which would again be time-consuming and not scalable. Hence, we needed to find a way to construct the knowledge base automatically.
  • Challenge 2: Labelling data in an unsupervised method. Polysemous or ambiguous words need to be labelled based on context. Such terms pose challenges while labelling data automatically. The image below highlights this issue.
Labeling ambiguous words automatically is a challenge.

As there was no off-the-shelf solution available for the above problem, we developed our own.

Our approach

A high-level diagram of the system

Tackling the above challenges required a three-part solution:

  1. Firstly, we create a knowledge base in an unsupervised way that consists of attribute-value pairs. This helps us annotate values and assign them with attributes, thereby overcoming the first challenge.
  2. Next, we apply word sense disambiguation to obtain labeled data without manual intervention. This would help us overcome the second challenge.
  3. Lastly, we build a NER model that helps us detect spans of text (values) and label them to entities (attributes).

Step 1 - Creating a knowledge base (KB)

We want to create a knowledge base with a list of attribute-value pairs that we will use for labelling. We assume that words similar to each other are used in the same context. Hence, we decide to cluster words used in similar contexts and label each cluster. The image below summarises the process of creating the knowledge base.

Pipeline for creating Knowledge Base

Here are the list of steps that we apply to create a knowledge base:

  • Data pre-processing: this task reduces any variables in our information to give the process the best chance of working correctly. We perform the standard text cleaning operations such as case standardization, removing punctuation, lemmatization, and removing stopwords.
  • Text representation (embeddings): while processing text for tasks such as classification or clustering, we need to represent or encode the text in a vectored format to be consumed by the downstream systems. In the table below, we provide the text representation options that we explored for our use case. We want our embeddings to be contextualised and fixed per meaning, which is provided by Adaptive Skip-gram.
Overview of word embedding algorithms explored for our use-case

The Adaptive Skip-gram (AdaGram) model (which extends the original Skip-gram) automatically learns the required number of prototypes for each word using a Bayesian nonparametric approach. With Adaptive Skip-gram, we have separate vectors for different meanings (senses) of a single word. The image below shows the nearest neighbors for the word apple being used in these two contexts.

Nearest neighbors of Apple, Left: Apple as an electronics product, Right: Apple as a fruit
  • Identify words of interest: since attributes and values describe a product, nouns and adjectives seem to be the most relevant part-of-speech. We use the NLTK library to identify words of interest with part-of-speech (POS) tagging. We do not consider numbers because a numeric value could have many variations, and they could be associated with multiple attributes.
Considering only nouns and adjectives as words of interest.
  • Text Clustering: we experiment with clustering algorithms for grouping words used in similar contexts such as hierarchical agglomerative clustering, centroid-based clusterings like (spherical) k-means, and density-based clustering algorithms like HDBSCAN and DBSCAN. We used cosine distance as a distance metric. We used the silhouette score as the evaluation metric for interpreting the quality of the clusters, and hierarchical agglomerative clustering was chosen from the above options.
  • Cluster Labelling: we try to find the appropriate broader meaning of a cluster using Babelnet in an unsupervised way, which is essential for our approach. This is where hypernyms come to the rescue. What we do is simple. We extract each cluster element's hypernyms using Babelnet and calculate the cosine similarity between the elements in the cluster and the hypernym obtained. We assume that the hypernym exists in the dataset. This gives us candidates for most semantically associated hypernym.

Step 2 - Labelling data

After constructing the knowledge base, we label data (product descriptions in our case). As mentioned earlier, it is essential to resolve ambiguity before labelling the data.

Word Sense Disambiguation (WSD): WSD is an open problem concerned with identifying which sense of a word is used in a sentence. The AdaGram model can infer the meanings of an input word given we know the surrounding words. The WSD module provides a probability distribution of the senses of the target word.

Once we have obtained the labelled data, we can train a NER model for attribute extraction from listings. The sense ID with the highest probability is selected as the relevant sense ID, and the cluster name associated with that sense ID is labelled as the attribute.

Word Sense Disambiguation in action

Step 3 - Named-entity recognition (NER)

We want to detect spans of text (values) and label them to entities (attributes). We use the SpaCy library in Python for implementing our NER model.

Tagging sequence: for training a NER model, the training data needs to be provided in a specific tagged format. This is also known as text chunking, where a chunk of text is labelled by capturing the contextual information. For creating labelled data for NER, we follow the IOB tagging scheme.

  • I — an inner token of a multi-token entity
  • O — a non-entity token
  • B — the first token of a multi-token entity

For example: consider the sentence, foo foo no no bar where foo and bar are the tokens we want to tag. The sentence would be encoded as foo-B, foo-I, no-O, no-O, bar-B.

Results

Clusters and cluster labels:

Below we provide two examples of the clusters and their corresponding candidate cluster labels that we obtain by using our cluster labelling approach.

Left: Cluster of car brands, Right: Hypernyms associated with the cluster
Left: Cluster of colors, Right: Hypernyms associated with the cluster

In cluster 1, for car-brands, we see the most semantically related hypernyms such as automaker/company in our top 10 results. Similarly, for cluster 2, we see the hypernym colour showing up in our top 10 candidate hypernyms. We can conclude from the above results that a broader term/ hypernym, which is relevant, can be determined using our approach.

NER results

Below we provide some results from the NER model built to detect car brands and colors.

Input-> Lee Cooper Women's Analog Rose Gold Case Blue Strap With Blue Dial - Lc06299.499Output-> color: blue
color: blue
----
Input-> 2007 Mercedez Benz c200 6 speed. Available in blue color. Car has new lic. Papers in order. Price is a bargain!!! To view please call or whatsupp
Output-> automaker: mercedez
color: blue
----
Input-> 1999 yellow audi a4 1999 audi a4 2.4 v6 avant full house aircon powersteering electric window leather seat
Output-> color: yellow
automater: audi
automaker: audi
----
Input-> Curtain second hand beige white bossed curtain leaf pattern really big beautiful
Output-> color: white

Final comments

Developing a system for attribute extraction can provide several benefits, such as providing more relevant searches and recommending products aligned to the user’s requirements. And the approach is scalable across different categories of products.

Extracting numeric or boolean values is beyond the scope of this approach, as numeric values can be used in numerous contexts, which makes it difficult to determine the attribute.

A significant advantage of this approach is that the attributes are defined automatically, which reduces manual intervention.

This work was done as a part of my summer internship programme at the Eindhoven University of Technology with Prosus. Please feel free to ask questions / provide suggestions in the comments section or reach out to us at datascience@prosus.com. Lastly, I would like to thank the Prosus AI team for supporting me in the decision-making process at every stage of development and making this internship a rich learning experience.

I want to thank Nishikant Dhanuka, Dogu Araci, Liesbeth Dingemans, Dmitri Jarnikov from the Prosus AI team, and Alexey Grigorev from OLX Group for their suggestions and help in editing. I am incredibly thankful to Piyush Dawande and Pratiksha Surpuriya for their help in making the visuals.

--

--