[archive] Discovering the Unknown Relevant Keywords

Milan Stankovic, PhD
Milstan’s Old Blog
4 min readMar 13, 2011

This is a re-print of a post that has been originally published on my blog on 13th of March 2009.

Research approaches for keywords suggestion have been around for quite some time. The need to help the users chose their keywords for tagging, web search and similar task lead to the development of a number of ways to suggest relevant keywords. Today, with the advent of web advertising, the finding relevant keywords has got a completely new dimension, as suggesting keywords no longer means just helping the user navigate on the Web, but also means driving the relevant visitors to your Web page. More and more services offer to suggest you the relevant keywords that cost less in advertising campaigns and that can pull you more traffic. However, there is an important dimension that those approaches have been missing out, and that significantly improve the way we discover new relevant keywords — it is their meaning. In this blog post I talk about how we use this important dimension for our keyword discovery needs at hypios, and report about the interesting results we have had.

hyProximity

The existing keyword suggestion approaches rely on (a) co-occurrence of terms in text corpora; (b) co-occurrence in search results; © controlled taxonomies such as Open Directory Project (ODP), and controlled vocabularies such as Wordnet. The approaches (a) and (b) both provide quite limited potential for discovery of unknown keywords, as they are based on co-occurrence. In other words, they try to look at terms that someone else has already used in combination with your initial terms, and suggest them. This approach does not allow to discover terms that are rarely used in combination with your initial terms, but that are very close in meaning. This is important, as the language we use on the Web is highly dependent our own community of practice/thought. Going beyond the terms used by people similar to us, is very difficult if we rely solely on co-occurrence. Approaches of type © have more potential as they they do not use co-occurrence based statistics, but rely on taxonomies and vocabularies. However, ODP is a Web directory, and thus the relations between terms are defined by Web browsing practice. There might be semantic relations between terms, which are not commonly browsed together, and thus would not appear in ODP. Wordnet is on the other hand more oriented at finding synonyms, and remotely related terms fall ourside of its scope.

For these reasons, we have turned to a Semantic Web-based approach, using DBPedia — a Semantic Web version of Wikipedia, to discover relevant terms. In DBPedia, terms — concepts, are grouped in categories by their meaning. As such this source of encyclopedic knowledge should enable the discovery of the keywords that are semantically related, but that an average user might not even know about.

Our system uses the distance between two terms in the graph of DBPedia semantic concepts, to calculate their semantic relatedness, called hyProximity. The shorter the distance in the graph, the higher the hyProximity. The more links the two concepts share, the higher the hyProximity will be.

Case Study

We have used hyProximity in our own use-case in hypios, and have obtained very interesting results. Our standard procedure, when we have a new innovation problem on hypios is to take the keywords related to the problem, and look for experts in our giant, cross domain, 900.000 expert base. Finding keywords relevant to the problem, that do not appear in the problem text is important in order to reach the relevant experts in most diverse domains, who might be able to bring an innovative solution. We have used hyProximity to obtain additional keywords for expert search, and compared those keywords with what we get from AdWords KeywordTool for the same inputs.

We identified 1802 experts using the keywords directly present in the problem text; 2849 experts with hyProximity keywords, and 2061 experts using the keywords from AdWords keyword tool. The most interesting phenomenon is that the overlap between the experts identified by hyProximity and AdWords keywords is very low. Finally, we measured the interest expressed by the identified experts (through their response to our e-mails). The response rate obtained in the hyProximity group was 10% grater then with the AdWords keywords, and 19% grater then with the keywords present directly in the text.

This result leads to a conclusion, that there is a significant number of semantically related keywords, that fall completely out of scope of the co-occurrence based keywords suggestion approaches. If you trust that the non-semantic keyword suggestion approaches are giving you all the relevant keywords, then you are missing out a lot of relevant traffic.

We are preparing a research publication and a public beta version of our tool, and will be disclosing more experiences with using semantic technologies for keyword discovery soon.

Originally published at web.archive.org on March 13, 2011.

--

--

Milan Stankovic, PhD
Milstan’s Old Blog

Milan is a Parisian Tech Founder. PhD in Computer Science from Sorbonne. Startup made and sold. Making computers better companions to humans. http://milstan.net