Ontology goodness measurement

Anton V Goldberg
Geek Culture
Published in
4 min readMay 13, 2022


At the end of the day, ontologies and knowledge graphs (some discussion on what these are at Upwork) don’t exist in empty space. They exist as a part of some business function, created to perform certain goals. At Upwork we use knowledge graphs to inform query understanding and semantic match. We also use them as a basis of browsing, to help users discover various items of interest (such as worker profiles or job openings), and we get users to select appropriate values for their profiles (freelancers), job openings (employers) and other documents. It’s very hard to measure an impact of ontology change on a search and match. All the metrics you can get off the logs or business impact are influenced by any number of factors and a change in ontology, unless it’s fairly dramatic, isn’t going to produce a noticeable result. Although we have a machine learning model that sometimes suggests “skills” and “occupations” (present in the knowledge graph) to a user, the user makes the final determination and the model is trained on the knowledge graph. This makes a comparison between the user’s choices and the knowledge graph way more direct and informational.

We asked ourselves how we measure the goodness of the knowledge graph given this environment and came up with the following procedure. First, we analyze the text of the documents (profiles and job openings), looking for any knowledge graph labels mentioned in there. We call the discovered attributes “adopted” to differentiate them from “assigned” by the user directly. Then we calculate all the categories adopted attributes roll up to. We select the top 3 categories (by the number of attributes), arranged in decreasing popularity (by the number of attributes) order. Then we calculate the relevance metrics on a per-document basis. We use 3 metrics: relative position, precision, and recall. Relative position we calculate using the following formula:

Relative position’s calculation

Where T is the list of top 3 adopted categories, and ac is the assigned category. For example, if the assigned category has the index 2 in the list of adopted categories(the least popular of top 3 adopted categories), Rp will be ⅓. For index 0 (the top category), Rp will be 1.

We defined Precision as the number of attributes that exist in both adopted and assigned sets of skills, divided by the size of the assigned set. We defined Recall as the number of attributes that exist in both adopted and assigned sets, divided by the size of the adopted set. Thus Precision tells us how many user-selected attributes are also in the discovered set. Recall tells us how similar is the content of assigned and adopted sets. Then we calculate mean values for all 3 metrics across all profiles and openings by document type. We don’t want to mix profiles and openings for many reasons, such as different kinds of users creating them, dissimilar selection process and UI etc.

There are many metric combinations that indicate various issues with a category. For example, low Rp (below ⅓) indicates that users for some reason select the category that shouldn’t be selected based on the skills actually present in the documents of that category. With a single document that can be attributed to a user’s behavior, but when aggregated across several hundreds or thousands of documents that might indicate misleading category’s name. Perhaps there are several categories with similar names or there is just no category that reflects a certain set of skills and the users are taking a wild guess. Low recall indicates that the users select attributes that are rarely discovered in the text. Coupled with low Rp that tells us the users select a wrong category and then their choice is limited by the attributes within that category. Low Precision might be a sign of a really wide category with a lot of attributes that can benefit from splitting into several smaller categories. There is also a very interesting case of categories with all metric values high and high total number of documents. As I mentioned above, we use the knowledge graph to inform semantic search. Such combination points to a category with virtually indistinguishable (by semantic means) documents. The best thing to do is to analyze the content of such a category and define more attributes, perhaps splitting the category in the process.

As far as Upwork-specific results go we discovered quite a few categories that need work. While there were many categories that suffered in both profiles and openings, many were specific to just one document type. Apparently the most affected category across both profiles and openings is language tutoring. Guys, help is coming!