How we applied Natural Language Processing in Trivia Games

What opportunities do Data Science teams have to explore natural language processing techniques in the gaming trivia industry?

etermax tech
etermax technology
Published in
5 min readMar 4, 2022

--

By Mailén Pellegrino, Data Scientist at etermax.

We are used to hearing about lifetime value models to predict how profitable a player is, or churn models to be able to carry out retention actions on users, but how can natural language processing be applied in the world of gaming? Specifically, how can we apply it in our trivia games?

Natural language processing or NLP is a branch of artificial intelligence that focuses on programming for the processing and analysis of human languages. In our trivia games, we have millions of questions and answers in various languages to explore.

An example: question similarity

One of the problems that we can find in our question factory is duplication. To avoid it, we have a duplicate question detector: a model that processes questions in order to determine how similar they are to others.

As in any machine learning model, the first step is preprocessing. This stage includes two very important points. On the one hand, the lemmatization, where we are left with the most basic form of the words (for example, “located” will be “locate” and “continents” will be “continent”). On the other hand, the elimination of words called stop words such as “which”, “in” and “is”, which are basically connecting words. We also perform normalization tasks, such as punctuation removal, and tokenization, where we separate terms into tokens, which in our case are words.

The modeling starts with two hard rules, mainly due to the high computational cost involved in comparing all the questions between them. First, to consider two questions as duplicates, they must necessarily belong to the same category (sports, geography, entertainment, art, history or science) and the same type of question (text or image). The second decision we made was that the answers had to be exactly the same. This brings us to the second stage of any NLP model: we need to create numeric (and/or categorical) variables from our text.

There are several ways to obtain numerical features from text data. The method we chose is text vectorization (that is, creating vectors for each question) using the Glove and Fasttext embeddings.

Embeddings are algorithms that have been trained with large amounts of text (for example, with Wikipedia, Google News or Twitter) and seek to encode the words in an optimal way, using neural networks. Generally, they are usually between 100 and 500 elements. Additionally, they seek to reflect the semantic similarity that words have in the language . For example, “king” and “man” will have a similar vectorization, as will “woman” and “queen” or “queen” and “king”.

Figure Source: http://veredshwartz.blogspot.sg

To compare the questions we use the cosine distance. This metric, which is used to analyze the similarity between two vectors, gives us a value between 0 and 1: the closer the value is to 0, the more similar the analyzed vectors will be. Inversely it happens with the similarity of the cosine, the closer to 1 they are, the more similar they will be.

Initially we did a first test using only the embeddings. However, in the validation stage we found high rates of false positives for certain particular cases.

We see that when we ask where a country is located, from the same continent, (that is, they have exactly the same answer), the embeddings generate high rates of false positives because they consider that Norway and Denmark are similar elements.

In a second iteration, we combine the embeddings with the Term Frequency — Inverse Document Frequency (TF-IDF) technique, where we not only take into account the value of each word of the sentence in the chosen embedding, but we also take into account the frequency of these words in all our questions. Thus, words that are repeated a lot will have less weight in the final vectorization.

For example, the words “continent” or “locate” are likely to carry less weight than “Denmark” and “Norway”. In this way, we achieve a great reduction in the rate of false positives.

Finally, depending on the category of each question and the manual validation process that we carry out, we choose a threshold from which we consider two questions as duplicates and lead to their elimination.

It’s worth noting that for questions with images, in some cases we need to compare the images as well, using image embeddings! But we will see that some other time…

Conclusion

It is very important for data science teams working in the trivia game industry to know where and how natural language processing can be applied. The projects that can be done are very varied and are not limited to the detection of duplicate questions.

In the case of etermax , and being present in more than 180 countries, we have players who speak different languages, and thanks to the automatic translations we do, we can constantly increase the base of questions. In the same vein of content generation, we have our own automatic question generator along with assigning questions to existing topics (such as the Harry Potter topic) and generating new topics via question clustering. Finally, there are also techniques to improve the quality of current questions such as grammar and/or spell checkers and inappropriate language detection.

--

--