Keyword extraction in e-commerce
Raj Shah, AI team
Keyword extraction is an automated process that analyses large amounts of text in order to identify and extract the most relevant and important words. Historically, it has been one of the most crucial steps in creating successful content for e-commerce websites and digital marketplaces. Traditionally discussed within the SEO context, keyword extraction and keyword research help identify relevant keywords that could generate increased traffic and high-quality leads for a brand or business. The more optimised content on a website, the better it ranks on search results leading to increased click through rates. In e-commerce, searching on online marketplaces is akin to searching on google. Finding and identifying keywords that customers use when searching for products online can enable brands to create content around these keywords to target and engage their audience. This is critical to improving the discoverability of products, ensuring that visitors are directed to the right products as a result of increased click through rates, and capturing the right sales as a result of higher conversions. Keywords can help build brand awareness and establish brand credibility among a target audience. Additionally, these keywords may also help both retailers as well as brands understand how the market perceives their products. Ultimately, keyword extraction and keyword research can be used in many ways but primarily, they are used to improve organic visibility across search engines of online marketplaces.
There are many different keyword extraction methods, some more powerful than others. In this blog, we will discuss both traditional statistical methods and deep learning methods of solving this Natural Language Processing (NLP) problem. Each has its own strengths and weaknesses, which we will try to address as comprehensively as possible.
Statistical Techniques
- TF-IDF
Term Frequency — Inverse Document Frequency (TF-IDF) considers how frequent a keyword is but also how rare it is across all documents, i.e., it estimates keyword importance in a document relative to the entire corpus. Its methodology can be described as –
i. Pre-processing, where we may make text lowercase, split the document into sentences, and remove stop words.
ii. Candidate generation, where we generate n-gram candidate phrases (w) without punctuation from each sentence.
iii. Candidate scoring, where we compute the following:
- Term frequency
- Inverse document frequency
- Final candidate score
iv. Final ranking, where we sort by descending order of TF-IDF(w) scores and select the top-N candidates as our top-N keywords.
TF-IDF is fast as it is mainly counting statistics for keywords, and is also language independent. However, it has its own drawbacks –
- It assumes word frequency provides independent evidence of similarity.
- It requires a corpus.
- It does not include semantics / semantic similarities between words.
2. RAKE
Rapid Automatic Keyword Extraction (RAKE) emphasises word frequency and co-occurrence (bi-gram and greater). Its methodology can be described as –
i. Pre-processing, where we may remove stop words and punctuation.
ii. Candidate generation, where we split the document at above stop word positions and punctuation. Words that occur consecutively between stop words or punctuation are considered candidates.
iii. Candidate scoring, where we compute the following:
- Frequency freq (w) — frequency of individual works within candidates.
- Degree deg (w) – identify word co-occurrence, i.e., words occurring often in longer candidates.
- Word score score (w) – identify words occurring more frequently in longer candidates than individually.
- Final candidate scores score (kw)
iv. Final ranking, where we sort by descending order of final scores and select top-N candidates as our top-N keywords.
RAKE is domain-independent but language-dependent, and also has its own drawbacks –
- Relies heavily on a rich stop word list, without which it may produce long texts as candidates.
- Actual keywords containing stop words may be missed.
3. YAKE
Yet Another Keyword Extractor (YAKE) does not need any linguistic features (NER, POS tags, etc.) to extract keywords from documents. Its methodology can be described as –
i. Pre-processing, where we split sentences using space and special characters as delimiters, and remove phrases containing punctuation marks and those beginning and ending with a stop word.
ii. Candidate generation, where we
- Specify maximum length of candidates required: n
- Use a sliding window to select i-grams ∀i ∈ 1, … , n
iii. Candidate scoring, where we compute the following:
- Casing — placing emphasis on acronyms and words beginning with uppercase letters.
- Position — placing emphasis on words present at beginning of document.
- Frequency — counts of words normalised by 1 standard deviation from mean.
- Relatedness — quantify how related a candidate is to its context by comparing with other words found on its left and right.
- Different — quantify how often a candidate occurs with different sentences.
- Score — calculate word score using the five features defined above.
- Final candidate score.
iv. Post-processing, where we prune the list of candidates to remove near-duplicates and/or multiple variations by comparing Levenshtein distances between target and accepted list of candidates. Once computed, we remove those having small distances.
v. Final ranking, where we sort by ascending order of final scores and select top-N candidates as our top-N keywords.
YAKE is domain-independent and language-independent and does not require a corpus, but also has its own drawbacks –
- In long texts with multiple topics, keywords that appear again but in different topics/contexts later in the documents may get missed. This is due to the positional information of words introduced in YAKE, with these keywords being given low weight compared to their earlier occurrence.
- Actual keywords containing stop words may be missed.
Deep Learning Techniques
KeyBERT leverages large language model embeddings (here, BERT) and cosine similarity to extract keywords most similar to the document. Its methodology can be described as –
i. Document-level representation, where we utilise BERT to generate a document-level embedding.
ii. Candidate generation and representation, where we use count vectorizer (or similar) to select n-grams and use BERT to generate embeddings of these n-grams.
iii. Candidate scoring, where we calculate the similarity between document and candidate embeddings using cosine similarity:
iv. Optional — increase diversity of keywords using Max Sum Similarity and/or Maximal Marginal Relevance.
v. Final ranking, where we sort by descending order of final scores and select top-N candidates as our top-N keywords.
KeyBERT is domain-independent and language-independent, however has the following drawbacks –
- Relies on modern language models — this can result in slow performance when using large language models.
- Relies heavily on vectorizers to generate candidates. When coupled with stop word removal, these vectorizers can generate candidates that do not exist in the document.
- Often requires additional help in the form of MSS / MMR to increase diversity of candidates.
In conclusion, keyword research and extraction has changed the face of modern e-commerce. The impact of machine learning in keyword extraction for e-commerce is undeniable. The ability of algorithms to identify the most relevant and important words quickly and accurately within a large amount of text has revolutionised the process of creating and optimising content for e-commerce. The ability to identify these keywords in real-time with minimal human intervention and generate highly targeted content has enabled e-commerce websites to improve their search rankings, discover new trends in their markets, and become much more effective at generating sales and improving the overall experience of their visitors. With potent tools such as machine learning and NLP, the future of e-commerce is bright.