Automatic Ontology Generation, Part 2: Results
In the previous post (link) , I have discussed a process for automatic ontology generation. In this post, I will write about the results of applying the process to Upwork’s set of active freelancer profiles.I’ll write yet another post talking about software we used and discussing various implementation issues. For now:
Thanks to the difficulty in understanding and controlling CoreNLP (text analysis software toolkit) we run the automatic ontology generation software (the generator, for short) implementing the process I described in the earlier post in two different modes: using Part Of Speech (POS) tagger only and using the dependency parser in addition to the POS tagger. The dependency parser provides the ability to recognize compound nouns. For example, using POS tagger only, “Java developer” will come across as two independent nouns “Java” and “developer”. Dependency parser will add a link between them we can take advantage of.
In the table below I use the terms described in the previous post. To make it easier to remember, “DP” stands for “domain pertinence filter”. “Summary filter” depends upon DP value, DC (“domain consensus”) value and a small coefficient.
The top row lists the result of POS tagger parsing, bottom row is for dependency parser’s results. Dependency parser produced significantly more terms after parsing the same documents as POS tagger because dependency parser adds compound terms to the terms that make the compound. That is, if the parts can be encountered in the text corpus on their own. Continuing the “Java developer” example, both parsers will find “Java” and “developer”, but dependency parser will also find “Java developer” as a compound, so 3 terms instead of 2.
It’s pretty clear dependency parser produced more useful results. It discovered approximately twice as many relations in about half as many domains, as just with the POS parser. There is a significant recall both among the relations and terms we discovered, however just like we expected some domains have much better coverage than others in our existing ontology. We can confidently bootstrap a significant number of domains that haven’t seen upgrades in a while. For example, for the “Voice Talent” domain we discovered approximately 400 terms definitely worth including in ontology. Among these terms the software discovered several new relations, however it’s very clear by looking at the discovered relations (see picture below for “Voice Talent” domain) the software missed quite a few relations that should be there as well.
For example, there is a relation “voice”->”voice actress” and “voice actor”->”audiobook narration” but there isn’t one for “voice”->”voice actor” or for “voice actress”->”audiobook narration”. Playing with the threshold defining a relation is on the list as one of the nearest goals. Current threshold of 0.4 is probably too high so some relations are getting lost. However at the start of this effort we set out to discover the highest quality data we could. That way we thought we would see at a glance if the approach is worth spending more time on. We could always lower the thresholds after we get a baseline. Meanwhile our ontologists are working on adding the discovered terms to the ontology proper. Now that we’re confident that this approach is worth pursuing, stay tuned for more updates on ontologies!