Automatic ontology generation, part 1
As I have written before, Upwork uses ontology in many places on the website, including semantic search, browse paths, SEO and a few others. We used to generate ontology graphs manually. A team of ontologists created curated taxonomies for different categories, creating more or less occupation-specific hierarchies. While that approach had many benefits, there are certain hard-to-address issues. For example, coverage: we don’t know what part of profiles and jobs is covered by our taxonomies. We don’t know if there are whole occupations presented in our marketplace with virtually none (or little) representation in the ontology. Another issue is keeping up: in some occupations terminology changes very quickly, so our ontologists are always playing catch up.
To overcome these limitations I decided to implement automated ontology discovery and building based on real data. The goal is to automatically produce candidate ontology updates on a regular basis and let ontologists quickly curate them and add to production ontology with minimal process and human involvement. I decided to take the approach described in  as a basis. There are several reasons I found that approach really appealing: the team used non-supervised methods to build ontology. That’s exactly what is required in our situation as our goal is to produce updates and bootstrap “empty” occupations quickly, with later improvement provided by the ontology team. The original software was written in Java, which gave us a high degree of assurance that the process can be scaled by parallelization. Also most of software development at Upwork is done in Java so we can assume a regular software developer will be able to understand and maintain the code we’ll produce.
In this post I’ll describe the functionality we implemented in the first stage of the project. In the next post, I will share the results.
The article I referenced above described various steps the authors took to build an ontology automatically. I decided to implement a simplified version of the process as the first iteration. Our process includes 3 simple steps (and a preparatory parsing to nouns):
Here we are looking for the terms relevant in the specific business domain. At Upwork, when a new user creates a profile or posts a job, the UI asks the user to select the main category the profile (or job) belongs to. That’s how we find the business domain of the original document. The first filter is Domain Pertinence (DP) filter (named by the authors of the article). The filter determines how specific is the term in the given business domain.
DP = freq(t/Di)/maxj(freq(t/Dj), where t is the term we are filtering, Di is the current domain and Dj is any other domain. In layman terms, we divide frequency of term in a given domain by the maximum frequency of that term across all domains except this one. If the term is as specific to the given domain so it has no presence in any other domain I set the divisor to one. We eliminate from further evaluation all terms with value below 30% term DP value mark. Then we calculate the Domain Consensus (DC) filter.
The goal of this filter is to find out how popular is the term t in the documents dk of domain Di. nfreq is the normalized frequency of term t in the document dk, calculated as frequency in this document divided by maximal frequency of that term in any document of any domain. The filter penalizes terms with higher frequency per document while rewarding terms with occurrences in more documents of a domain.
As the last filter we calculate Summary Filter (SF) as a linear combination of normalized DC, DP and k. Constant k we set to 0.02 if the term has been present in a title of any document in that domain, on the assumption that the terms found in a title are more important. For terms not present in a title, k is 0. We chose the value of k based on meta-parameter optimization done by the authors of the original publication. I suspect that k is very much text corpus-dependent but parameter optimization is something we can do at the later stage, when we verify what the product works reasonably well providing good updates for our ontology. Current value of k should be reasonably good for a wide range of text data.
SF = 0.4*norm(DP(Di,t))+0.6*norm(DC(Di,t))+k. For normalization we divide filter’s value of a term t in domain Di by maximal value of that type of filter in that domain. We eliminate all terms with value below 40% term value SF mark.
For the remaining terms we build relations using the subsumption method. The method is based on terms’ co-occurrence. Intuitively if there are 2 terms specific to a domain and one term shows up only (or mostly) in the presence of another, while the other one occurs in more documents than documents containing the first term the second term subsumes (“wider”) than the first (and the first is “narrower”). For example, in the “Art and Illustration” domain “anime” can be a “wider” concept to “chibi”, “fanart” and so on. The formula we used is:P(x j y) >= t; P(y j x) < t ; where t is a threshold value I set to 0.4. Unlike the authors of the original article we also established a minimal number of documents the terms need to appear at. Upwork has a very large number of profiles and job posts if compared to the set of documents used in the original article. As ontology influences search and match we would like to avoid terms specific to a single profile or job posts for a variety of reasons (including bloating).
At last we check for the appearance of terms and relations we found in the existing ontology. There are multiple goals we are trying to achieve with that check: we calculate the coverage of the ontology vs terms from the new crop of profiles or other documents. We calculate the recall value to see if it’s time to recalibrate the meta parameters, we find places to plug in newly found terms based on various similarity criteria. And we simplify the job of our ontology team taking away a part of manual labor. Finally, we rejoice :)
 A Semantic Approach for Extracting Domain Taxonomies from Text
Kevin Meijer, Flavius Frasincar, Frederik Hogenboom