Text Mining with Evolution Algorithms

Published in

Heroes Herald

7 min readSep 10, 2021

Problem Outline

In order to stay at the forefront of a highly competitive market, it is often necessary to keep track of what your global competitors are doing. Unfortunately, this information is not easily found and therefore requires significant effort to gather from online sources and than extract only the relevant information. Currently the client was spending a significant amount of time manually searching online to discover new competitor offerings that would help not only gain a better understanding of the market space, but also allow them to target potential new leads. This manual searching is also such a time intensive task that it can only be done on a monthly base, meaning that often a lot of the information that has been gathered could already be old. Another challenge faced by the client came from the fact that they operated in multiple markets across the world, meaning that not only was their search incredibly large, but it meant that information was rarely only in English and therefore requires a multi-language team to effectively find all relevant market information.

Solution

Therefore, the task was to build a Proof of Concept system capable of automatically gathering market information from the web on a daily basis, predict whether the information was relevant in multiple languages and finally present it back to the client in a way that would integrate well with their workflow, all within a 3-week window.

The focus of this article will be to explain the system at a high level, but we will dive into the custom AI model that was created for this task.

System Overview

The proposed solution system to help solve the client’s problem can be seen below, with the core functionalities of the system being the following:

Gathering Data
Predicting & Sorting Data from most to least relevant
Presenting results
Retraining model based on user interactions

Data Gathering

The first step is to setup several scrapers that collect raw text from multiple online sources such as social, search & news. These scrapers are setup to directly look at competitor social media posts or to search for information related to the competitors on the wider internet. This scraping process is repeated daily for all competitors in all different languages and fed into a database where it is stored for use by the rest of the system.

Text Pre-processing

A key step in any text modelling task is to apply pre-processing steps to the raw text in order to clean & standardise it before it can be analysed or modelled.

The reason why pre-processing is so important is that models are often unable to recognise that two words hold the same meaning. Word pairs such as “Win” & “win” or “win” & “won” would all be considered separate words even though the base meaning is identical, and so by standardising these words it ensures that the model does not split it’s understanding across multiple identical words. Alongside this, the removal of “stopwords” (common words such as “a”, “an”, “the”, etc) ensures that only important words that carry the meaning of the sentence are included, otherwise any model would quickly become overwhelmed by these common words which wouldn’t add any information about what the text was about.

AI Model Development

The first step of the model process was to transform the raw text into a format that the model would be able to understand. In order to do this, a list of the top 150 most frequently used words in both Relevant and Non-relevant items was created. These two lists were then merged and used as a “word presence filter” of sorts.

Example of how the word presence filter works

As seen in the diagram above, the raw text is passed through the filter to determine whether a word is present within the document, this now standardises all possible inputs and therefore is ready to be used to train a model.

The Model

The core of an Evolutionary Algorithm can be thought of as a parallel to real world evolution amongst organisms, where in an environment only the fittest survive long enough to reproduce and pass on their genetic information. But rather than actual organisms with genetic information, we are instead creating virtual organisms where their genes represent the importance/weight of each of the words in predicting whether or not a piece of text is relevant.

Setting up the EA to be able to modify incoming presence filter output

What this means is that the Evolutionary Algorithm can be thought of as a simple modifier, as seen above, if a word is present the weights will be added to the total and if the total reaches a certain threshold the model predicts the incoming text as “Relevant”. Now that we have a way to predict how “Relevant” a piece of text is, the next step is to let the Evolutionary Algorithm run and optimise the weights to improve for us!

The basic way the Evolutionary Algorithm works can be broken down into the following steps:

0. Create random population of solutions

1. Initialize new population

2. Assess each solution for prediction performance

3. Select a % of the top performing solutions and disregard the rest

4. Cross-over the genetic information of top performing solutions

5. Randomly mutate the resultant population

6. Repeat steps 1–5

What this process results in is that while an improvement might not be seen every cycle, the overall effect is that genes that code for the best solutions are propagated forward each generation and improved on by the cross-over of well performing parents & mutation. This means that by the end of the experiment, a highly competitive set of word weightings are produced that result in the most performant prediction model. This best performing model is then used as the final prediction model to be used in the production environment.

Prediction Performance Evaluation

As mentioned above in the high-level steps of the Evolutionary Algorithm above, a key component of the process is the prediction performance assessment of each of the potential solutions. There are many different ways the prediction performance could be measured such as using accuracy, but it was decided to use the F1 metric instead as this delivers a better indication as to the actual prediction performance of the model.

The reason why F1 is a superior metric to use to evaluate model performance is because it is generated as a balance between two other metrics:

Precision

What % of records that were predicted “Relevant” are actually “Relevant”?

Recall

What % of actual “Relevant” records did it find?

What this means is that the F1 score expressed both how good at identifying relevant records as well as how many of the relevant posts it finds in the entire dataset. Thus by using this metric we get a more balanced overview of actual prediction performance.

Cross-over

In Evolutionary Algorithms Cross-over is the process of combining the genes of two parents and creating a new generation of offspring as seen below. The way the genes are selected from the parents can vary, but for this the simplest method of selecting each gene from a random parent was implemented as seen in the diagram below.

Results

While the system was only part of a proof of concept and therefore has plenty of room to grow as the models are continually trained and therefore improved, the results show that there is already a significant improvement from the old to the newly proposed system.

As the above improvements show, all measures show that the new system provides an excellent boost in making market information more easily gathered and sorted than their previous method. But what has to be considered the most important improvement is that the system can now be run on a daily instead of monthly basis. This single improvement will allow the client to instantly react to important market developments rather than have to wait until the end of the month, thus giving them a competitive advantage.