TF*IDF and How it Works with SEO
Since you’re reading this post, you may already be familiar with tf*idf (at least to some extent).
There are plenty of well informed articles, and even some tools, out there that are worth your attention.
But what’s more is the research, and the math — and I think understanding how these fit together is worthy of a new post.
There are a lot of folks much smarter than myself building natural language processing engines at a much higher level than I’m capable of; but that’s not what I’m here to talk to you about.
Instead I want to show you what we’ve built and how we’re using it. If you just want to go grab some tf*idf data and not muck about reading the rest of this post, you can go ahead and do that clicking the button below.
The current version is built to support English only, but we have plans to add other languages in the future.
But this post isn’t just to show off a beta demo of our shiny new tool, but instead to hopefully start a conversation around optimizing content for SEO by focusing on topic relevance.
Considerations Beyond Information Architecture
Crazy as it may seem, I’ve found that topic modeling and optimizing content to speak to specific concepts holds more sway with Google these days than even URL and information architecture in some respects.
I know I know, it’s heresy coming from me who has long preached the importance of IA as the foundation for any high performance website, but Google’s approach to ranking pages based on topical relevance and intent has changed.
And most (if not all) of these beautiful database driven libraries that are not only accessible but freely available for us to use, process, and build upon.
So What’s a Technical SEO To Do?
Take advantage of course.
We’ve built a tool that analyzes the term population and frequency of the top 20 organic ranking URL’s in Google, spitting out the tf*idf calculation for each term and then (if you so choose) scores them against your target URL and/or sample of content from your document for your specific input keyword.
The purpose of this is to see how your current content is using the terms that are being used by the pages Google has deemed worthy of a top ranking for the same target keyword; what are the topics and concepts being represented and how often are these terms appearing (or not appearing) in the overall document population.
From here you can adjust your content to include more of the terms Google may be expecting to see in the frequency that Google is expecting to see them.
If you haven’t yet built the page or created the content, that’s fine too. Just don’t set a target URL and instead run the report for a keyword to see what topics you should be addressing in your content.
What is TF*IDF
Tf*idf stands for term frequency times inverse document frequency.
Tf*idf is a numerical statistic used in information retrieval to represent how important a specific word or phrase is to a given document.
Wikipedia goes on to define tf*idf as:
The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
Tf*idf is most often used as part of latent semantic indexing (or LSI), which is a technique for processing language (also often referred to as natural language processing, or NLP) and allows for systems to rank documents based on relevance against a specific term or topic.
The goal of this approach is to make sense of a population of unstructured content to score what it’s about and how strongly it represents that topic or concept versus other documents in the sample population.
The purpose of which is to allow machines to understand what pages are about.
LSI came about as a solution path to work around the 2 most challenging constraints of using boolean logic for keyword queries; multiple words that have similar meanings (synonymy) and words that have more than one meaning (polysemy).
This approach places weights on terms in a document as determined by 3 factors;
How often does the term appear in this document?
The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.
The term frequency is calculated as follows:
tf(t in d) = √frequency
Inverse Document Frequency
How often does the term appear in all documents in the collection? The more often, the lower the weight.
Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like future or SEO help us zoom in on the most interesting documents.
The inverse document frequency is calculated as follows:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body field.
The field length norm is calculated as follows:
norm(d) = 1 / √numTerms
Consider a document containing 100 words wherein the word SEO appears 3 times.
The term frequency (i.e., tf) for SEO is then (3 / 100) = 0.03.
Now, assume we have 10 million documents and the word SEO appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4.
Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
Want to run a tf*idf analysis for a keyword you’re interested in?
An N-Gram is a set of co-occurring words within a given population of text.
These are computed as part of analyzing the topics contained within a document by typically moving one word forward, though you can move X words forward when applying calculations to more complex data sets.
For the purposes of calculating tf*idf, terms are usually calculated as unigrams (one word terms), bigrams (2 word terms), or trigrams (you guessed it, 3 word terms).
An example of this would be if you took the sentence; The SEO needs more links to rank the page, the bigrams would be:
- The SEO
- SEO needs
- needs more
- more links
- links to
- to rank
- rank the
- the page
So in this example there are 8 n-grams. If we were instead to look at the trigrams of this same sentence, they would be:
- The SEO needs
- SEO needs more
- needs more links
- more links to
- links to rank
- to rank the
- rank the page
So it reduces the total n-grams to 7, if N=3.
When it comes to computationally processing for natural language (especially for SEO), topics seem to be best represented by bigrams and trigrams, so it’s important to understand the distinction.
Why Are TF*IDF and LSI Important for SEO
An over-simplified answer is that these toolsets are literally the building blocks of search engines and how Google is scoring and associating your pages with keywords related to the document’s content.
Another way to think about this is Google has billions of pages to crawl and score for relevance on topics that surround a user’s submitted query.
In order to return results Google needs to rank these documents based on relevance.
Not all of the documents will contain all of the terms relevant to the query, and some terms are more important than others. The relevance score of the document, at least in part, on the weight of each term that appears in the document.
What Do The Results Look Like?
So while running these types of calculations is actually pretty straightforward, it’s not really as simple as just adding and dividing up a bunch of term and word counts.
Instead it’s best to lean on some of the open source libraries (like this one written in Python) and hook these up to some HTML crawlers to process this data for you more accurately.
You’ll want to run a target URL for a target keyword AND, more so, in order to actually analyze the results against something to make sense of them, you’ll need a population to run your URL against.
We use the top 20 organic ranking pages on Google for the target keyword, and then scrape all the HTML on those pages, strip out the header, footer, nav, and common stop words, and then calculate the tf*idf on the remaining document corpus.
So for a quick example I’m going to take a look at the tf*idf weights for the keyword: content optimization as a term, on it’s own:
If I mouseover each of the individual bigram bars, I can see the approximate term weight for each across this population of URL’s:
As you can see the approximate term frequency for content marketing across Google’s current top 20 ranking URL’s for content optimization, is 1.97%.
If you want to specifically review how a page you’re building to target this keyword and related topics stacks up to the current page ranking URL’s on Google, then set a “target URL.”
From which we’re able to highlight specific terms appearing at varying weights throughout the term population between the 2 document sets, to start to identify where there are variances between your target URL and the term frequency among the top 20 ranking pages.
This can be a URL on your site OR one of your competitor’s URL’s, so for the purposes of this example I picked an article on KaiserTheSage.com that is currently ranking in position #2, and ran it through the tool;
So in the KaiserTheSage.com article, you can immediately see that they’re using the terms high quality, search engine, organic traffic, keyword phrase, start with, title tag, and alt tag more than the average across the other top ranking URL’s (but not by much).
It’s also worth noting how representative Kaiser’s (Jason Acidre’s) post is of the most used bigram phrases across the top 20 ranking post corpus.
Please Note: you will notice some variance between the 2 data runs; the first without a target URL and the second above with a target URL. This is due to bigram normalization between the 2 corpus sets with and without the target URL.
What To Do With This Data
Ideally you would have an editor with a live view so you could rework your content to better build out the focus of topics and terms that Google is expecting to see…
Our tool in it’s current form doesn’t do this, but this is exactly what we’re working on — so it’s coming.
However, if you do want to at least grab the tf*idf weights for a target keyword and target URL to see how they stack up against the current top 20 ranking URL’s on Google for that keyword.
From here you can make an effort to adjust your content/page to better represent the terms and corresponding frequencies appearing across the current ranking URL’s, to do a better job of presenting the content and topics that Google is currently rewarding.
The empirical data, references, and research used to inform the creation of this post is thanks to: