Capturing UGC semantics for SEO

Nikita Mathur
IndiaMARTlab
Published in
8 min readDec 25, 2018

Using NLP to capture important keywords from user generated content

Semantics is powerful because..

At IndiaMART, we seek to connect buyers & sellers. We not only need to capture a wide variety of keyword searches by potential buyers, we also need to record the semantic importance of these keywords.

It is essential to utilize keywords that internet users are searching (UGC — User Generated Content)& update our content effectively for better SEO & constantly rank high on SERP. We need to convert these keywords & existing keywords into useful content through our category pages, FAQ section, meta content etc.

Why is keyword usage a challenging task?

SEO is not just search engine optimization but also “Something Extremely Obscure”

Since our keywords are long tail and terms are used with varying context, we need to target every keyword through the most relevant channel.

Few examples of keywords for Air Compressor category are as follows.

Classifying a keyword as to where it can be used in one (or more) channel of doing SEO out of FAQ section, Meta content, product specifications, product names, alternate names for products etc. is a challenge. Sources have been explained in detail below.

Sources of keywords identified

  1. Webmaster — Keywords buyers/sellers use to reach Indiamart website obtained using webmaster tool
  2. Product Display Page — Name of a supplier’s product page
  3. Product Groups — Group names used in Indiamart seller catalog to group related products
  4. Keywords searched on Indiamart website — Keywords searched in search bar on Indiamart website
  5. Related keyword search — Keywords that appear at the bottom of Google when a keyword is searched. See an example in the following image

Even better..Google up something & just scroll down..

Channels to utilize keywords

These are the channels through which we can utilize the UGC effectively & is the output of this experiment.

  1. New product name
  2. A new alternate name for a product
  3. Important product specification
  4. Meta content
  5. FAQ section
  6. Geographic hubs (cities, states, countries)
  7. Generic stop-words

Choosing the right tool

A common problem in Natural Language Processing (NLP) is to capture the context in which a word has been used. A single word with the same spelling and pronunciation (homonyms) can be used in multiple contexts and a potential solution to the above problem is computing word representations.

FastText is a library created by the Facebook Research Team for efficient learning of word representations and sentence classification. This library has gained a lot of traction in the NLP community and is a possible substitution to the gensim package which provides the functionality of Word Vectors etc. FastText differs in the sense that word vectors a.k.a word2vec treats every single word as the smallest unit whose vector representation is to be found but FastText assumes a word to be formed by a n-grams of character, for example, sunny is composed of [sun, sunn,sunny],[sunny,unny,nny] etc, where n could range from 1 to the length of the word. Unlike word2vec, fastText can learn vectors for sub-parts of words — character n-grams — ensuring e.g. that the words “supervised”, “supervise” and “supervisor” all have similar vectorisations, even if they tend to appear in different contexts. This feature enhances learning on heavily inflected languages.

Since the keywords we want to target are long tail & are used with varying context, fastText serves the purpose.

Summary of the process formed

This process is done category wise & is divided into 2 phases and two sets of data are used — Training data & Testing data.

Phase 1

Training data — is divided into 7 buckets of different labels ( product category name, product specification, product category alternate name, geographic hub name, FAQ section, meta content, generic stop-words).

Examples of training data are as follows:

These labels are the channels through which the keywords are to be utilized for SEO.

Testing data — Keywords collected from the sources already mentioned above.

The model is first trained on these 7 labels. Then the keywords are matched with one of the 7 labels.

Phase 2

Training data is more detailed as all the labels are extended to include a specific product name, Generic stop-word etc. in the label itself. For example — __label__product__air_compressor, __label__faq_how.

The number of labels increases in this phase. This is done for more specific tagging of the keywords. The model is trained with these labels post which the testing is done. Keywords matched with every unique label in phase 1 is further matched for the specific label of its kind in phase 2. A probability cut is decided for each of the 7 labels taken in phase 1 above which the keyword<>label matching is observed to be correct. This percentage has been decided by reviewing the matching manually. These filtered keywords are used for SEO via the channels they matched with.

A little complicated? Read it again..

Final process flowchart

Experimenting with the model..explained in detail..

You might want to read a little about machine learning along with this article….

Phase 1 — To tag all keywords by 1 of the 7 basic labels [product category name, product specification, product category alternate name, geographic hub, FAQ section, meta content, generic stop-words]

1. Gathering the data

The experiment was performed on a group of Air Compressor comprising of –

  1. Industrial Air Compressors
  2. Reciprocating Compressor
  3. Screw Air Compressor

Making training data & testing data..

Training data

Two sets of training data was made -

Phase 1 training data & Phase 2 training data.

Phase 1 training data is very basic. 7 labels of existing text corpora for Air Compressor cluster were made. These 7 labels are

  1. Product label : __label__prod
  2. Specification label : __label__spec
  3. Alternate name label : __label__alt
  4. FAQ label : __label__faq
  5. Generic stop-words label : __label__stop-words
  6. Geographic hub label : __label__city
  7. Meta related terms label : __label__meta

In this phase, the name of a product, spec etc. is not present in the label. It is only present in the corpus. For example: __label__spec brand atlas copco

Phase 2 training data is more detailed in terms of label as the name of product is defined in the label as well.

For example — __label__spec_brand brand atlas copco

Testing data

151 keywords were collected from 5 keyword sources (Product groups, Google Analytics, Webmaster, Related keywords, Indiamart Internal Search keywords) for evaluation.

This collection of keywords is the testing data. As explained above, these keywords is the input data for this funnel.

2. Preparing the data

The 151 keywords & existing corpus was cleaned using the below steps

  1. Converted all the text to lower case
  2. Trimmed the data
  3. The most recurring term — Compressor — was removed from the testing & training data using a python script to remove stop-words
  4. Removed all special characters
  5. Filtered the 151 keywords on the basis of google search vol/month (for India)

3. Training the model

The model was trained on 7 unique labels (product, alt names, specs, meta, hubs, generic stopwords, FAQs).

The model was trained on fastext with hyperparameter values minn 3, epochs 50, lr 1, wordNgrams 1.

  • Minn — min length of char n gram
  • Lr — The learning rate control the how “fast” the model is updated during training: this parameter control the size of the update which is applied to the parameters of the models
  • wordNgram — max length of word ngram

4. Testing & evaluation

Testing the 151 keywords on the model trained with 7 labels gave vague matches between keywords & labels as minn 3 lead to underfitting.

Few examples of data tested

Parameter tuning — Altered minn from 3 to 5 & epoch from 50 to 20, keeping all other parameters constant. Model was trained again.

Few examples

5. Improved result

~90% accuracy was achieved in matching the right labels with the keywords.

Phase 2 — Matching keywords that have matched with 7 labels to a more specific entity.

For example -

Gathering the data

Testing data — same 151 keywords as in phase 1.

Training data:

  • 201 unique product categories of Air Compressor cluster
  • Air compressor standard & user generated specs
  • Air compressor Hindi/English alternate names
  • Custom list of meta related keywords, generic stopwords, geographic hubs for this cluster

Data prepared as in phase 1

Model trained with minn 5, epoch 20, lr 1, wordNgrams 1 while matching keywords with products, alt names & specs

Model trained with maxn 0, epoch 20, lr 1, wordNgrams 1 while matching keywords with meta related keywords, generic stopwords, hubs

Testing & evaluation:

  • Keywords matched with every unique label was tested for the specific label of its kind
  • Specific matched received was in sync with the ideal match in 90% of the cases

Result & Discussions

Following is the confusion matrix used to compute precision & recall & check the performance of this model.

Precision: how many instances were correctly predicted?

Precision = [True Positive/(True Positive +False Positive)]

True Positive + False Positive = Total Predicted

Recall: how many were correctly captured?

Recall = [True Positive/(True Positive+False Negative)]

True Positive + False Negative = Total Actual

Insights

False Negative & False Positive is highly critical as it can lead to errors in the output which can go unnoticed when dealing with a large number of keywords from multiple sources at once.

Predicted negative needs to be reduced to avoid manual work.

Important specs identified based on search volume and count of keywords matched

  • Power (HP)
  • Brand
  • Type

Important content for FAQ section identified

  • Air compressor working principle
  • Reciprocating air compressor vs other

Conclusion

  • As of now, FastText is the best suited tool for identifying the right usage of user generated keywords for building SEO & identifying key/config specifications
  • Manual effort is close to zero
  • Accuracy is as high as 90%
  • As the model is prepared & right parameters identified (minn, maxn, epoch, lr, wordNgrams values), time consumed to run the process horizontally on all the categories of products (apparel, home decor, construction machinery etc.) is very low

Way Ahead

Phase 3

  • To Include additional sources of keywords from SEO tools and internal channels
  • Scale across all categories on Indiamart

Phase 4

  • To experiment using other NLP models and techniques

Interested in making a career out of it? Reach me at nikita.mathur@indiamart.com

--

--

IndiaMARTlab
IndiaMARTlab

Published in IndiaMARTlab

IndiaMART is India’s largest online B2B marketplace, connecting buyers with suppliers. With 60% market share of the online B2B Classified space in India, the channel focuses on providing a platform to Small & Medium Enterprises (SMEs), large enterprises as well as individuals.

Nikita Mathur
Nikita Mathur