Computational Linguistics for SEO

Malte Landwehr
13 min readMar 21, 2015

--

Apart from “organic/earned links” and “good content”, there are a couple of other factors Google might consider in the future to rank websites. In this article, I take a look at current scientific research in computational linguistics and interpret the results through the eyes of an SEO.

Even though the website of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing) looks like a bad joke, it features highly interesting content that sparks many ideas for possible current and future ranking factors. I picked 19 scientific research papers, which were accepted at the 2013 CICLing and examined them in respect to brand, content, SEO tools, website architecture and authorship.

Explanation

The following citations will follow this structure:

Title
relevant quote
Authors

Brand

Topic-oriented Words as Features for Named Entity Recognition:
Topic-oriented words are often related to named entities and can be used for Named Entity Recognition.
Zhang, Cohn and Ciravegna [Springer]

Zhang et al. are not trying to sell relationships between brands and keywords as something new. They assume the existence of such relationships as established knowledge that does not require further explanation. What does that have to do with SEO?

If you search for the term email on google.com, one of the top 5 results is login.yahoo.com. But the term email is nowhere to be found. Not once does it say email in either the title or the body. And even the top 10 anchor text (measured by Majestic) do not contain the term email.

But Google does recognize Yahoo! as an entity and knows that there is a Wikipedia article on the subject (en.wikipedia.org/wiki/Yahoo!). And in that article, the term email is used 10 times!

Additionally there are probably millions of websites out there, that link to login.yahoo.com and mention email somewhere near that link. Actually I don’t just think so, I know:

This is called co-occurrence and Rand Fishkin did a good coverage of the concept. Yahoo is just one of many examples for this.

Attention! Co-occurrence is not limited to the content of websites. Co-occurence of a brand and a keyword can happen in any context. Searches are probably playing a big role, too! Just take a look at Google autocomplete:

In this case, both content of websites and searches are contexts. But there are many more possibly contexts, in which brands and keyword can co-occur. For example:

  • emails
  • instant messages
  • videos (via speech-to-text)

When you are trying to establish a brand (or fake one :-D), make sure it is mentioned together with relevant keywords in as many contexts as possible. Relevant keywords in this case are those keywords that occur together with the major brands in your sector in various contexts.

Extracting phrases describing problems with products and services from Twitter messages:
Automatic extraction of problem descriptions from Twitter data. The descriptions of problems are factual statements as opposed to subjective opinions about products/services.
Gupta [PDF]

Gupta shows techniques that extract problems with products from tweets. The important part is, that the algorithm allows for a distinction between objective and subjective issues!

Learning: Google might be able to tell whether you are doing Real Company Shit that your customers actually like.

Google might be able to recognize when people are talking about your brand in a negative way and have legitimate problems with your services. Theoretically, such analysis could include Gmail, Google Talk, Google+, Google Voice and everything said or written through the Chrome browser, Android and Chrome OS.

Entity Linking by Leveraging Extensive Corpus and Semantic Knowledge:
Linking entities in free text to the referent knowledge base entries, namely, entity linking is attractive because it connects unstructured data with structured knowledge. […] Furthermore, we propose a novel model for the entity linking, which combines contextual relatedness and semantic knowledge. Experimental results on two benchmark data sets show that our proposed approach outperforms the state-of-the-art methods significantly.
Guo, Qin, Liu and Li

Another team of scientists who care about mentions of entities in texts. And they are linking similar entities! At this point, I want to point to my Yahoo example again: Try to build (attract) the same signals, the big brands in your market are getting!

Content

Cross-Lingual Projections vs. Corpora Extracted Subjectivity Lexicons for Less-Resourced Languages:
Subjectivity tagging is a prior step for sentiment annotation.
Saralegi, Vicente and Ugarteburu [Springer]

Saralegi et al. are not coming up with a method to determine subjectivity in texts; they assume the existence of such algorithms to be common knowledge! We can deduct, that Google is able to differentiate between an objective product description and a subjective report. When you are writing texts, think beforehand whether the result should be objective or subjective.

Automatic distinction between natural and automatically generated texts using morphological and syntactic information:
Our work lies in the field of automatic metrics for assessing text quality. […] to distinguish normal texts written by man, on one hand, from automatically generated texts or automatically processed and intentionally damaged natural texts, on the other hand.
Tsinman, Dyachenko, Petrochenkov and Timoshenko [PDF]

Tsinman at al. are able to identify automatically written texts with a precision of 96%. So don’t try to automatically generate texts, based on tables with product data!

Discursive Sentence Compression:
A method for Automatic Text Summarization by deleting intra-sentence discourse segments. First, each sentence is divided into elementary discourse units and then, less informative segments are deleted.
Molina, Torres-Moreno, Sanjuan, Cunha and Martínez [PDF]

Molina et al. are able to shorten texts. Their algorithm identifies (and subsequently deletes) text passages which are less informative.

What can we learn from this?

  1. Yes, it is possible to identify content that does not contain relevant information for the reader.
  2. No, it is not a good idea to write a 400 word description for every product in your shop without thinking about your readers “because you need content to fight Panda”.

Automatic Text Simplication in Spanish:
A Comparative Evaluation of Complementing Components: In this paper we present two components of an automatic text simplication system for Spanish, aimed at making news articles more accessible to readers with cognitive disabilities.
Drndarevic, Stajner, Bott, Bautista and Saggion [PDF]

If Drndarevic et al. are able to automatically generate a simplified version of a given text, then Google is probably able to recognize the difficulty of your texts as well. Write content with a complexity that fits your target audience!

Text Simplification for People with Cognitive Disabilities: A Corpus-based Study:
This study addresses the problem of automatic text simplification in Spanish for people with cognitive disabilities.
Stajner, Drndarevic and Saggion

Stajner et al. deliver another example for automated text simplification. If automated text simplification is a thing, automated measuring of text difficulty must be a thing, too!

Automatic Detection of Outdated Information in Wikipedia Infoboxes:
Not all the values of infobox attributes are updated frequently and accurately. In this paper, we propose a method to automatically detect outdated attribute values in Wikipedia. The achieved accuracy is 77%.
Tran and Cao [PDF]

I presented the dumbed down version of this approach at the German 50 Leute. 100 Steaks. conference (50 people. 100 steaks) in 2013 as a trick for linkbuilding on Wikipedia. Back in 2009 Adar et al. wrote about this in Information Arbitrage Across Multi-lingual Wikipedia (PDF).

Now Tran and Cao presented a version that is outright genius. And they documented it in such a way that you can copy it!

The basic idea is this: They take information from a Wikipedia infobox and perform a couple of Google searches in order to determine whether this piece of information is accurate or not.

This work is listed under “content” because if facts on Wikipedia can be verified, than Google can verify facts in your content as well. This is especially critical when you put a text online and then don’t change it over the years.

Learning: Google might prefer accurate information and penalize articles with outdated facts.

Automatic Glossary Extraction from Natural Language Requirements:
We present a method for the automatic extraction of a glossary from unconstrained natural language requirements. We introduce novel linguistic techniques in the identification of process nouns, abstract nouns and auxiliary verbs. The intricate linguistic classification and the tackling of ambiguity result in superior performance of our approach over the base algorithm.
Dwarakanath, Ramnani and Sengupta [IEEE]

Just think about it: You take a long document and feed it into the software from Dwarakanath et al. and you get a glossary of the most relevant terms! If you are an SEO, I don’t need to explain to you, what can be done with this. Just a simple idea:

  1. Pick a huge category on Wikipedia
  2. Remove all articles with length < 500 word
  3. Feed all remaining articles in the glossary maker
  4. Put the results on {category}-glossary.com and use appropriate markup
  5. ???
  6. $$$

Facet-Driven Blog Feed Retrieval:
The faceted blog distillation task retrieves blogs that are not only relevant to a query but also satisfy an interested facet. The facets under consideration are opinionated vs. factual, personal vs. official and in-depth vs. shallow. Experimental results show that our techniques are not only effective in finding faceted blogs but also significantly outperform the best known results over both collections.
Jia and Yu [PDF]

Jia and Yu build an information retrieval system, that does not only identify blog articles relevant to a search term. They take another step and check whether the articles can satisfy the searcher’s intent. To archive that, they assign each text to the following categories:

  • opinionated or factual
  • personal or official
  • superficial or in-depth

And then they use this formula. Please check their work (linked above) for details!

Depending in the search context (for example derived from a user’s search history, their Google+ profile and current location), Google could prefer documents from one category or another. Again: When you write texts for the internet, make sure you know beforehand which of the above mentioned characteristics you need!

Predicting Subjectivity Orientation of Online Discussion Threads:
Topics discussed in online forum threads can be subjective seeking personal opinions or non-subjective seeking factual information. Hence, knowing subjectivity orientation of threads would help in satisfying users’ information needs more effectively. Experimental results on two popular online forums demonstrate the effectiveness of our methods.
Biyani, Caragea and Mitra [PDF]

When Biyani et al. are able to recognize, whether a discussion is dominated by factual information or personal opinion, than Google can probably do that as well. Depending on your website’s intention, you might want to consider moderating comments and forum posts more strongly. The magic word is community management!

SEO Tools

Analyzing the Sense Distribution of Concordances Obtained by Web As Corpus Approach:
Some authors have proposed using the Internet as a source of corpora […] based on information retrieval-oriented web searchers. This work analyzes the linguistic representativeness of concordances obtained by different relevance criteria based web search engines. Sense distributions in concordances obtained by web search engines are, in general, quite different from those obtained from the reference corpus.
Saralegi and Gamallo [Springer]

Saralegi and Gamallo demonstrate that text corpora, based on documents retrieved through Google searches, are not necessarily representative when you look at their linguistic properties. This means that it might not be enough to look at the 100 first Google results for “insurance” in order to perform text analysis like WDF-IDF or n-grams.

N-Gram-based Recognition of Threatening Tweets:
We investigate to what degree it is possible to recognize tweets in which the author indicates planned violence.
Oostdijk and Halteren [Springer]

That Oostdijk and Halteren are trying to identify tweets containing an intent for violence, has nothing to do with online marketing and is only listened as another example for the power of n-grams. By the way: What do you think how the NSA performs analysis on text messages and Facebook chats? ;-)

Site architecture

Single-Document Keyphrase Extraction in Multiple-Document Topical Keyphrase Extraction:
Here, we address the task of assigning relevant terms to thematically and semantically related sub-corpora and achieve superior results compared to the baseline performance. […] were considered better in more than 60% of the test cases.
Berend and Farkas [PDF]

You want to automatically generate tags or meta-keywords for thousends of documents? Than you should take a look at this approach by Berend and Farkas!

A knowledge-base oriented approach for automatic keyword extraction from single documents:
A generic approach for keyword extraction from documents. The features we used are generic and do not depend strongly on the document structure. We show that it improves the global process of keyword extraction.
Jean-Louis, Gagnon and Charton [PDF]

Jean-Louis et al. show another approach to automate the creation of tags or Meta Keywords.

Don’t Use a Lot When Little Will Do — Genre Identification Using URLs:
In this work we build a URL based genre identification module for Sandhan, search engine which oers search in tourism and health genres in more than 10 different Indian languages. While doing our experiments we work with different features like words, n-grams and all grams. Using n-gram features we achieve classification accuracies of 0.858 and 0.873 for tourism and health genres respectively.
Priyatam, Iyenger, Perumal and Varma [PDF]

Priyatam et al. are using n-grams to determine a website’s topic, just by looking at the URL. They don’t have to crawl the website at all; not even a header-request. And they still get 80% precision!

Never forget that even with all its server power, Google has an interest to minimize unnecessary requests in order to save costs. In case Google uses this approach: Put the most relevant keywords in the URL!

Authorship

The Use of Orthogonal Similarity Relations in the Prediction of Authorship:
Recent work on Authorship Attribution (AA) proposes the use of meta characteristics to train author models. The meta characteristics are orthogonal sets of similarity relations between the features from the different candidate authors. […] we achieve consistent improvement of prediction accuracy.
Sapkota, Solorio, Montes-Y-Gómez and Rosso [PDF]

Sapkota et al. did not come up with the idea of matching texts to authors. They simply suggest an alternative process. If we assume Google uses similar algorithms, it might be a bad idea to mark up texts from multiple writers with a single Google+ profile. Additionally, texts from a single writer should not be published with authorship from multiple Google+ profiles. In both cases Google might recognize foul play and take away the authorship markup in SERPs for the domain or the Google+ accounts.

Syntactic Dependency-based N-grams:
More Evidence of Usefulness in Classification: Sn-grams differ from traditional n-grams in the manner of what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking the words as they appear in the text. Sn-grams can be applied in any NLP task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. Obtained results are better when applying sn-grams.
Sidorov, Velázquez, Stamatatos, Gelbukh and Chanona-Hernández [PDF]

N-grams are an integral part of many of the scientific papers discussed in this article. Just take a look at the Google Ngram Viewer if you still doubt how awesome n-grams are! Now Velázquez et al. suggest that some n-gram-based algorithms will yield even better results when performed on sn-grams!

Further more, the work by Sidorov et al. is another example for automatic text-to-author-mapping. See above for the resulting consequences.

Constraints

Obviously, the conclusions I drew from these research results are rather exaggerated. Many of these algorithms have been implemented for just one language and were only tested on a few datasets. Additionally, just because someone somewhere is engaging in this research, Google does not necessarily know or care about it. However, all cited papers went through a peer review process, meaning that they were approved by researches who work in the same field. And the CICLing is not any conference; it is B ranked, among the top 25% most downloaded at Springer, counted by Google Scholar as one of the 6 most important computational linguistic conferences and listed among the top 8 NLP conferences by Impact Factor at ArnetMiner.

I know that the “Yahoo! Mail is ranking for email”-example is not a very good one. I do my SEO-work almost exclusively in Germany and did not want to perplex you with a German example. Unfortunately I did not find a suitable English example, so Yahoo Mail had to do. Even though Mail is very similar to email and the yahoo.com domain has enough trust, authority and backlinks to rank for anything. Feel free to leave a comment if you stumble upon a better example!

One other note: [T]o prevent quotes from […] becoming illegible, […] I quoted in a relatively [free] manner without labeling this each time.

Conclusion

I hope I was able to convince you that computational linguists are engaging in a couple of topics that might impact SEO over the next couple of years. Sadly, not many SEOs care about such scientific approaches and almost no one takes the time to blog about such things. If I was able to motivate some of you to change this, writing this article was worth it.

Full disclosure: The German version of this article was released in January 2013 over at SISTRIX.

--

--

Malte Landwehr

Head of SEO at idealo. I talk about Enterprise SEO. Former Management Consultant and VP Product.