“We all travel the Milky Way together, trees and men.” — John Muir

How a text recommendation system can be the functional core of a product

Bogdan Mursa
Sortlist Engineering
8 min readDec 11, 2019

--

This article is willing to educate the readers about how a text recommendation system can be the central pillar of the matching processes in a product.

The discussion will have as the main actor a recommendation system built using the Word2Vec framework and reinforced with concepts from Complex Network Theory which aims to provide accurate matches between marketing projects and marketing agencies.

The volume of information flowing virtually through the WWW network increases proportionally with the number of websites that appear in any given moment. It would be cliché to say that companies on any scale do not understand the importance of the information that their products can capture, as most of them are following a data-driven approach while taking business decisions.

“Data is the new oil” — the mantra of the current technological era.

There is an immeasurable number of applications that can use data to model the desired feature — computer vision, deep learning or natural language processing are just a few examples of domains where data plays a significant role. These applications are usually used to automate business processes or facilitate the user experience of products features.

This blog post is willing to represent an overview of how designing such an application facilitates the user experience of products that have text data as a primary source of information. Either for matching, information retrieval or classification purposes, a properly build text recommender system can be a powerful toolset with solutions for all these challenging tasks. What I aim is to educate the audience about how it is possible to build and use such a solution, but do not expect a step-by-step tutorial. I found it more valuable to approach the subject from an architectural perspective while summarising or just reminding low-level technique aspects and focus on the text recommendation system as a tool that can be integrated into multiple areas of a product.

What you will read in the upcoming minutes, represents my personal experience in designing such a text recommendation system for Sortlist, a company with a product that is aiming to help businesses finding the best marketing agencies in the world for their marketing projects. When we think about a marketing project, we need to imagine a briefing that contains information such as budget, location, desired skills, desired interests, and many others. Most of this information can be used in simple filtering operations, but unfortunately, only one type of requirements fall into the chaotic world of fuzziness: skills.

Most of the times companies do not know exactly what they are looking for from a technical perspective — because a briefing can be written either by a small shop owner willing to create a website to promote its products or by an SEO expert which is looking for a team of developers to refactor a website to be aligned with the newest SEO trends.

Considering this impediment in customer behavior, the goal was to design a technical proposal that could be a general solution for multiple use cases on the platform, either on matching or recommendation features.

Generally, any kind of text information is interpretable only by a human being as it has features strongly tied to linguistics aspects. Hence we need to find a way to transpose a text into a structure understandable by the computers as well. A solution to this particular problem is offered by natural language processing through word embedding — a technique that helps to represent any given word into an N-dimensional vector of features. An N-dimensional vector represents the position of a word in a space with the same numbers of dimensions.

Example of a 2-dimensional projection of a space trained using texts about world countries and their capitals.

Figure: Two-dimensional projection of a vector space trained using texts about world countries and their capitals

Simplifying the concept we can make an analogy with the geographic coordinate system which enables which enables to pinpoint any location on Earth with 2-dimensional vectors named coordinates (latitude and longitude). Moreover using this representation in space it is possible to compute the distances between any two coordinates.

Keeping in mind this analogy for a simpler 2-dimensional space, it is possible to compute distances between any two points in spaces with higher dimensions as well, namely N-dimensional spaces. Hence, putting our imagination at work we can visualize how a bunch of words into an N-dimensional space are closer or not to each other depending on their linguistic similarity. Ok, pause — It is only me or it was impossible for my brain to visualize the N-dimensional space? Right, the human brain is able to comprehend only 3 dimensions, so we will have to deal with the N-dimensions as an abstract mathematical space.

To create the N-dimensional space, namely the vector space, we need a training set and a framework that is able to embed the words from the training set into the vector space. A crucial subject that I feel is superficially discussed in most of the text recommendation tutorials is preparation of the training set — we mentioned how each word is transposed into a vector of features which can be seen as an N-dimensional fingerprint of that, but we never mentioned how the list of features is computed. There is no simple way to explain it but saying that the features represent the context of a word in the training set. Hence the same word can have different “neighbors” in a vector space modeled with marketing texts than in a vector space created with literature texts for example. This behavior will be even more prominent for words that are specific to a given domain. Long story short, it is strongly recommended to have a training set with content specific to the field where the model will be used.

Having a proper training set, it is recommended to follow a sequence of preprocessing steps from which I can remind: word normalization (lemmatisation or stemming), n-gram detection (important for entities with multiple words), non-alphanumeric characters removal and many others. The processed text can be fed into a word embedding tool that can train the vector space — the state-of-the-art being Word2Vec, a two-layered shallow network proposed by Tomas Mikolov. There are various implementations of it, personally, I used Python’s gensim library which has an friendly interface (compute the similarity between 2 words, compute top N similar words for a given word, arithmetic operations between word vectors, etc.).

Going back to my case, although I managed to obtain a satisfying vector space from the point of view of marketing terminology accuracy, I was still not happy with the results. I wanted to visualize my creation. Because we are facing N-dimensions there is no easy way to create a plot or something else to facilitate the visual projection of the space. A state-of-the-art solution would be principal component analysis (PCA) — a dimensionality-reduction technique, but I wanted something else, something more visually expressive and one that could facilitate the navigation in the space.

After some time spent for looking a candidate, I realized that I already had a solution the whole time: the area that I intensively study for my Ph.D., Complex Network Theory — a field that proposes the modeling of real-world systems as topological spaces similar to mathematical graphs. The beauty of these structures is provided by the infinite degree of analysis one is allowed to do using them using the framework proposed by this field. So without thinking twice, I modeled the vector space as a complex network and using an amazing tool, obtaining Milkyway — a galaxy where each planet represents a skill and an edge between two planets is given by the distance of the corresponding skills the N-dimensional vector space:

Using both Word2Vec trained space and Milkyway (the complex network) we were able to create a remarkable tool with a perspective of appliance in most of the areas of the core business product. Using Word2Vec interface to compute similarities and a bunch of properties proposed by the Complex Network Theory such as communities, shortest paths, motifs or open triangles (all widely studied in Complex Network Theory) it was possible to immediately develop features in the product that had a significant impact in the user experience or preciseness of the obtained results in different processes.

Having Milkyway and its visual representation as a complex network it was easier to facilitate close cooperation with the business people and identify together useful applications of the model. The following paragraphs represent a discussion over a few areas where the model was used:

Suggestions — One of the key processes of the platform is when users select a list of skills either as a company that proceeds with a project briefing or as an agency that creates a profile accordingly to their expertise. Because the users can have a different level of knowledge in the field, sometimes it is hard for them to choose the right skills. With the proposed tool it was possible to develop a suggestion system that receives as input a list of skills (chosen by the user at any given moment) and returns as output a list of suggested skills — knowing that similar skills will be grouped in the so-called communities, it was possible to return a precise answer by combining results of the vector space distances with the shortest paths and open triangles of the input skills in the modeled complex network.

Matching: Using the same principles as for suggestions, the tool is able to match project briefings skills (A) with agencies’ skills (B) while applying a procedure of intersection between space A and space B. In this way, the matching algorithm becomes smart enough to match complicated combinations of skills. As an example, if someone that wrote a project briefing needs an agency with experience in Google Ads, besides a straightforward match with the agencies that have Google Ads in their profile, the model will also match any agency that have SEA in their profile skills, SEA being a marketing skill related to Google Ads — the same principle applies for any other skill found in the proximity area of Google Ads. Having more skills to match means that the results will be more precise as the overlapping of the intersections will be smaller.

Marketing — Combining information retrieved from Milkyway and Google search volumes, Sortlist is able to automatically generate landing pages for the most relevant and in trend skills at any given moment. This has a significant impact on the traffic of the platform leading to an increase in demand and supply, hence making happy every type of user of/on the platform.

These were just a few examples of applications of the tool we call Milkyway which already stands as an important asset for Sortlist project. I hope this article gave you a better understanding of how such a system can be used in real products and the benefits they bring in processes that they are integrated in.

TL;DR 😇
Interested in joining the product & engineering team of a fast-growing tech start-up and you want to use top-notch technology? Sortlist is always looking for talented people to join us in Belgium or Romania.

Check out our positions here:

--

--

Bogdan Mursa
Sortlist Engineering

AI specialist. PhD student and Complex Networks researcher with a focus on Network motifs. Hobbies: Chaos Theory and Basketball