# Discovering similarity among prospective ontology terms

In other posts I discussed how we select prospective terms in freelancer profiles and job postings, then establish parent-child (subsume) relations (part1, part2, part3). Discovering term similarity is the next step we implemented in Upwork’s automatic ontology update. We choose the approach proposed by [1] as a basis for our method. Authors of that article proposed to calculate similarity as a function of several components: distance (shortest path,l) between the terms in WordNet and depth of the shortest path. The intuition here is fairly straightforward: the closer the terms are to each other, the more similar is their meaning. If the terms are in the same synset (In WordNet, synset instances are the groupings of synonymous words that express the same concept), the distance between them is the shortest and they are synonyms. The farther away they are the less similarity is there in their meaning. For example, the similarity between “gem” and “jewel” is higher than between “gem” and “rock”.

The intuition on the depth (h) is based on the feeling that more detailed terms reside in deeper levels of WordNet hierarchy. For example, consider the following segment of WordNet 3.1 hierarchy. Terms “bicycle” and “scooter”’ are richer semantically than “vehicle” and “dolly”, hence similarity between the former pair is higher than between latter, despite length of shortest path being the same.

*S *= f1(l)*f2(h),

We calculate similarity *S *as a product of two functions f1 and f2, where α is a parameter scaling the shortest path contribution and β is a parameter scaling the depth contribution. In our calculation, we take α = 0.2 and β = 0.6. Using the example tree above, the distance between both pairs (scooter/bicycle and vehicle/dolly) is 2. The depth we measure as the distance from the top of the hierarchy to the first common parent. Generally speaking, WordNet’s hierarchy for nouns starts from the term “entity”, which is the root of all noun trees. In our example, if we take “transport” as a root, the depth of vehicle/dolly is 0 (“transport” is their first common parent and the root of the hierarchy). Depth of scooter/bicycle pair is 2 — the distance from the root (“transport”) to the first common parent (“wheeled vehicle”).

While that is a really useful formula by itself, we needed more. Turns out, the terms we uncovered after analysing texts of actual user profiles and job descriptions are often compound: made of more than one word. Our analysis also uncovered many thousands of terms missing in our ontology. While we could compare each term to all the others, the computation would clearly be rather expensive. We would also likely get a lot of false positives with terms that can have business domain-dependent meaning. Such terms can be similar to some terms in one business domain and quite different terms in another. For example, “client” can be close to “buyer” in “Social Media And Marketing” domain and to “node” in “Software Development” domain.

The original article indicated that compound terms are typically too specific to be actually similar to anything. That’s likely true in many areas of knowledge but for Upwork’s document we found that to be incorrect. For the purpose of searching in our profiles or job openings there’s plenty of mutually replaceable compound terms. To be able to compare compound terms we use the following approach. To calculate similarity *S* of two terms we begin by creating vectors of the maximum length among the terms. For example, if we compare terms (uncovered by methods I described in earlier posts) “accountant advisor” and “bookkeeper”, we create vectors of length 2. Then we take one of the terms for a starting point and calculate pairwise similarity between words of the starting term and the words of the other one. We find the highest similarity and exclude the used words from further consideration until all words are consumed. As the result is clearly dependent upon the order of vectors, we repeat the procedure starting with the second vector. For the example of “accounting advisor” and “bookkeeper” both vectors are [0.8187,0]. That’s because “accounting” fits “bookkeeper” best and consumes “bookkeeper”, so “advisor” has no word to calculate the similarity to and its similarity defaults to 0. When we calculate the reverse: similarity from “bookkeeper” to “accounting advisor”, “bookkeeper” consumes “accounting” and there is no second word in that term, so the second member of the vector defaults to 0. Then we multiply resulting vectors’ norms and normalize to range 0 to 1 by dividing the product of norms by vectors’ length (size).

The approach has obvious limitations: the best similarity between 2-word term and 1 word term is 0.5 (0.335 for the example above), no matter how close the terms are semantically. However, for our purpose of automatically detecting at least some terms of similar meaning for Upwork’s ontology the calculation works really well.

To make the calculation faster and decrease false positives we calculated similarity only between terms in the same business domain. Time complexity of the calculation is n2, if n is the number of terms. Limiting the similarity search to terms within business domains significantly decreases the total calculation time.

# Results

After some experimentation, we decided on setting a similarity threshold to 0.8187. The calculation found a bit less than 9000 similar terms, the vast majority of them being 2 word terms. Out of these 9000, approximately 2700 terms were variations of each other — words used in reverse order or with a space inside. Out of the rest, the percentage of false positives was close to 1%. However these false positives were terms only a person well-versed in the given business domain would recognize as such. Considering that some source documents (job postings) were created by “clients”, who aren’t necessarily well-versed in the domain they need help with (for example, a client might need help creating a website while their main business is far from that) we aren’t sure we should remove such false positives from the words with similar meaning list. For example, our software found “software technologist” to be very similar to “software programmer”, and “office automation” to be close to “business automation”.

On our “to do” list are further improvements to the formula for similarity of compound terms. We are also experimenting with updating search results based on the uncovered lists of similar words. Now we can iterate on the results quickly.