Combining Ontology And Machine Learning To Improve Match

Anton V Goldberg
Geek Culture
Published in
5 min readFeb 17, 2022


Ultimately Upwork needs to solve one problem: match employers to employees. The particulars of the match change with the intent of the user and the nature of the job. For example, algorithms used to match clients to freelancers differ from the algorithms used to help freelancers find jobs. However currently all of the algorithms are based on matching “skills”.


Here I need to talk a bit about the structure of an ontology used by Upwork (look into previous posts for more details). Let us think of ontology as a set of connected taxonomies, where each taxonomy resembles a tree with a single root node and multiple branches. The lowest levels of ontology (leaf nodes) are “skills”. Skills are assembled into “skill groups”. The structure of higher levels of the ontology isn’t relevant to this post. One can think of the “skills” as the true skills needed to perform a job or practiced by a freelancer. For example, “Java” (as a programming language) or “Video Editing” are skills. Skills and skill groups are connected by many-to-many relations. I.e. a single skill can be a member of multiple groups and a single group includes many skills. For example, “Video Editing” is a member of a group “Media Editing & Production”, which also includes “Camera Operation”, “Audio Conversion” and a few other things. We create Skill Groups for various purposes. A Skill Group can represent a service that can be performed for a job (“copywriting services”) or group things of similar purpose (“software product”) or serve other functions.

Now I need to talk about search queries, because ultimately this is all about a match of jobs to people who can do them and how skill groups can participate in the matching process. Any self- (and user-)respecting website spends an inordinate amount of attention on the user interface designed to help users produce better search queries or narrow down search results. Upwork’s website is no exception to that rule. However, after we looked at the most popular search queries, we realized that a very high percentage of these queries are not very specific. For example, “podcast” or “devops”. It became very clear that any attempt to map such queries to skills then matching skills to skills associated with jobs, profiles, Catalog projects, or anything else produces unstable (changing with minimal changes in the algorithm) and often unhelpful results. We need another approach.


To emulate Upwork’s skill-mapping algorithms I decided to use a word2vec model. A major component of Upwork’s algorithm is a set of ML models, albeit of a different vectorisation type but built on the same principles of continuous bag of words. I chose word2vec because it’s really simple to work with, thanks to the gensim toolkit ( Its behavior is well known and there is a huge pre-trained-by-Google model ( I reasoned that the latter would help me surface any training-related issues. Then I ran all Upwork’s skills through the Google-trained word2vec model. An interesting detail surfaced in that process: the Google-trained model, despite being trained on 100 billion words, shows its age. In the 10 years since the model was released a lot of new technologies (and corresponding skills) sprung up. The model couldn’t locate about 25% of the skill set.

Model generation and skills’ processing
Model generation and skills’ processing

I stored the embeddings for skills from Google’s word2vec model in a separate model file. Then I used that file to calculate embeddings for queries and find the most similar skills. I calculated the prediction accuracy by comparing the skills predicted by the similarity criteria to the skill hand-picked for the query. I compared the top 3 skills per query and across all queries the accuracy was about 20%. I.e. total of all skills predicted correctly divided by total of all hand-picked skills was ~0.2, which is fairly low. If Pi is the set of top 3 predicted skills and Hi the set of top 3 hand-picked skills per query, and N is the total number of queries then the accuracy A expressed as:

Accuracy calculation formula, size Hi is always 3

As a next step, I trained a word2vec model based on a representative set of Upwork’s freelancer profiles. Then I generated the embedding for all skills just like I did before but using this new trained model instead of Google’s model. That model also missed a significant number of skills, albeit about half of what Google’s model missed. Accuracy calculated by the same formula was also on the level of 20%. Predictions of the models were different (i.e. skills predicted for a query by Google’s model weren’t the same as predicted by Upwork’s documents trained model) but the overall accuracy didn’t change significantly. As an interesting tidbit of information, the size of Upwork’s documents-trained model (in the form of vectors file on disk) is less than 30MB, while the size of Google’s model is well above 1GB.

So far the modeling follows known facts about skill predictions’ instability. The next part is about making better predictions. I reasoned that the models predict different skills for the same queries, but these skills should be in approximately the same ballpark. Upwork’s way of expressing ballparks is the skill groups I mentioned above. In other words, the skills predicted by the model and hand-picked skills must be members of the same skill groups. To prove this hypothesis I retrieved skill groups for predicted and hand-picked skills per query, then calculated their intersection. If Pi is the set of predicted skills groups and Hi the set of hand-picked skills groups per query, and N is the total number of queries then the accuracy A expressed as:

Accuracy calculation formula

was approximately 30%, which is 10% absolute or 50% relative better than accuracy in skill prediction.


Clearly, absolute accuracy numbers in the range of 20–30% are not really great. However the jump they indicate in moving from skills to skill groups clearly warrants a lot of attention. Intuitively it’s easy to interpret by realizing that people who are using unspecific queries don’t really care if the database their backend developers are going to use is MySQL or Postgress, as long as there is a database. It’s also important that a very small and undertrained (in terms of the size of training set and data preparation) specific model in this case did as good (or as bad) as a really large generic model.