Puffins are small, charming seabirds that are a common sight in Iceland, particularly near the beach. They are known for their colorful beaks and striking black-and-white plumage. In Iceland, puffins can be found nesting in colonies along the coast, often burrowing into the steep cliffs that line the shoreline. — Imitating natural selection for feature trimming

Optimising feature selection with genetic algorithms — an easy to use Python script

Published in

Just Eat Takeaway-tech

9 min readApr 3, 2023

Introduction

At Just Eat Takeaway.com, improving our data-based models is part of our DNA. Recommender systems are an important part of our connection with our users. To make the best restaurant recommendation possible, a long list of features has been defined which is used to provide suggestions via a horizontal carousel. These features determine which content appears in the carousel. For instance, one might suggest breakfast restaurants (see image below) in the morning or coffee shops in the afternoon.

In the world of data science, features can grow exponentially over time, creating what we call a “feature explosion”. Among these, there are legacy or market-specific features, and some may exhibit a high degree of correlation with one another. Each feature is tied to a particular data pipeline, which contributes to the complexity and maintenance costs of the system. It is therefore essential to periodically review and evaluate which features are worth keeping. Reducing the number of features can also lead to improved model performance.

Here lies the problem: imagine you have 100 features and you want to reduce it to, let’s say, 30 features. Using the combination formula below:

import math

# Calculating the possible combinations nCr
n = 100 # total number of objects in the set
r = 30 # number of choosing objects from the set
nCr = math.factorial(n) / (math.factorial(r) * math.factorial(n - r))
print(f"The total possible combinations without repetitions is: {nCr}")

A total of 2.94e+25 possible combinations! If you want to train your recommender and test that amount of combinations, it will take you quite some time. That’s where optimisation techniques can help out machine learning.

There are various methods to reduce the number of features used in a model. One approach involves the analysis of the individual contribution of each feature through techniques such as Shapley values, which originated from game theory. However, in cases where features are highly correlated, relying solely on Shapley values may not reveal the complete story. In such instances, genetic algorithms can be employed to explore interactions between features, leading to a more comprehensive analysis. These methods offer a rigorous and systematic approach to feature reduction in data science.

Algorithm description

Genetic algorithms are a class of optimisation algorithms inspired by the principles of natural selection and genetics. They are used to solve complex problems that are difficult to solve using traditional optimisation methods. Genetic algorithms imitate the process of natural selection, where the fittest individuals in a population survive and reproduce, passing on their genes to the next generation.

In a genetic algorithm, the problem is represented as a population of potential solutions, where each solution is called an individual. These individuals are evaluated using a fitness function that measures their ability to solve the problem. The individuals with higher fitness are selected to be parents of the next generation, and their genetic material is combined to create offspring that inherit traits from their parents. This process is repeated for several generations until a satisfactory solution is found.

One of the major advantages of genetic algorithms is their ability to find global optima, unlike other optimisation methods that may get stuck in local optima. However, genetic algorithms can be computationally expensive and require careful tuning of the parameters to obtain good results.

The article will provide an easy-to-implement and replicable Python script that does not use real customer data, in order to demonstrate how genetic algorithms work and enable readers to create their own genetic algorithms quickly.

The Genetic Algorithm

Let’s dive into the technical components. Below you can see the flowchart that our genetic algorithm will follow:

Start

Here is where you should import the libraries to be used and get your list of features ready. To make this simple, 100 features called feature_0, feature_1, etc, will be generated:

from typing import List
import pandas as pd
import numpy as np
import random

# Generate a list of 100 features
n_total_features = 100
list_total_features = [f"feature_{i}" for i in range(0, n_total_features)]

Initial population

The second step is to create an initial population. A population is a set of individuals. An individual, in this context, is a list of features to be evaluated. The initial population is of size num_individuals, and each individual is of size num_features.

def create_population(list_total_features: List[str], num_features: int, 
                      num_individuals: int) -> List[List[str]]:

  initial_population = []
  for _ in range(num_individuals):

        # Generating a string of random numbers between 0 and 
        # the lenght of the possible features to generate
        # random individuals

        test_features_list_id = random.sample(range(0, len(list_total_features)), num_features) 
        initial_population.append([list_total_features[feature_id] for feature_id in test_features_list_id])
    
  return initial_population

Evaluation

After the initial population has been created, the next step is to evaluate the fitness of the individuals. Here it needs to be business-specific. Let’s assume that we are using the Area Under the Receiving Operating Characteristics Curve (or AUC) score to evaluate our model.

The AUC is a value that goes from 0 to 1, where 0 means that all the predictions of the model are wrong, 1 means that all the predictions of the model are right. A 0.5 AUC score is known to be the score you would get if your model is making random predictions.

As the idea is to give you a code that you can easily replicate, and since we are not using a specific database, a code that assigns a random fitness value between 0.4 and 0.9 is provided. An ID is also given to each individual to keep track of all of the individuals to exist. This should be replaced by your own machine learning model. For a set of features, you will train and evaluate the model and get your desired metric (AUC in our case). For simplicity:

def population_evaluation(initial_population: List[List[str]], 
                          ID_counter) -> pd.DataFrame:

    # General counter to keep track of every individual trained and evaluated
    c = ID_counter
    individuals_list = []

    for individual in initial_population:

      random_performance_metric = random.uniform(0.4, 0.9)
      individuals_list.append([c, random_performance_metric, individual])
      c = c + 1

    column_names = ["ID", "Fitness", 'Features']

    return pd.DataFrame(individuals_list, columns=column_names)

Selection

So far you created a random initial population, and you trained and evaluated this random population while keeping track of each individual, the set of features and its fitness. Now the fun part of the genetic algorithm starts: Selecting a set of individuals for reproduction.

The underlying principle of genetic algorithms is the concept of reproduction. Individuals with high fitness will reproduce offspring with high fitness. In this case, we consider the features we select to be the chromosomes. These features are randomly passed down to their children in the algorithm, either from parent A or parent B. This principle of reproduction can be used to our advantage in machine learning models. In a model, individuals (i.e., sets of features) that score highly will likely pass on those high-scoring features to their offspring, who will inherit some of their fitness. However, it’s important to note that this is not always the case. Sometimes, highly fitted individuals may produce offspring with lower fitness or vice versa. Despite this variability, genetic algorithms can still be a powerful tool in machine learning. By leveraging the principle of reproduction, we can iterate through many generations of the algorithm, with each generation getting better at solving the problem. Over time, the algorithm will converge on a solution that optimises the problem’s objective function.

Keep in mind that this is an optimisation problem and not an exact replica of evolution. We are still trying to get to the optimal solution while limiting the calculations as much as possible. This means that we will select the individuals that will reproduce and create children. For this, there are different methods. Rank selection, for example, consists of sorting the individuals by their fitness, and only allowing to reproduce the top rated individuals. This, however, may lead to a quick convergence towards a local optima. In the example below, a roulette wheel selection method has been proposed. A roulette wheel selection is a process that allows better fitted individuals to be selected, but it doesn’t guarantee it.

The total fitness of the population is distributed by the fitness of each individual, and a roulette wheel is created as in the image below. Each time the roulette wheel is turned, one individual is selected, and the weights of the roulette recalculated. This allows all individuals the possibility of reproduction, with more probability to the higher fitted individuals and less to the less fitted individuals, which can contribute to the diversity of the population and a better chance to get to the optimal solution.

Here is the code to do this:

def population_selection(evaluated_individuals_df: pd.DataFrame,
                         n_individuals_to_select: int) -> pd.DataFrame:

    return np.random.choice(evaluated_individuals_df['ID'], n_individuals_to_select, replace=False,
                            p=evaluated_individuals_df['Fitness'] / evaluated_individuals_df['Fitness'].sum())

Reproduction

The next step is to create children. The reproduction is done through a crossover method. There are different ways to do it, but here we use a crossover in the middle of each individual. Assuming the following two parents, where each block is a chromosome (i.e., a feature):

Two children will be created as follows:

There is the possibility that when two parents create children, some of the chromosomes are shared and passed to the children as duplicates. If this is the case, the duplicated chromosomes will be replaced by a new random chromosome (feature). This process is known as mutation. This process will be repeated until all of there are no more duplicated chromosomes.

When developing genetic algorithms, mutation is often an explicit part of the process and it is intended to add diversity to the population. There are other techniques that may increase diversity, such as “doomsday”, which consists of randomly eliminating some of the individuals and adding new ones after a few generations. This hasn’t been considered here, but it can be interesting to explore this and other interesting ways to improve genetic algorithms.

def population_reproduction(selected_ind_list: pd.DataFrame, 
                            list_total_features: List[str], num_features: int) -> List[List[str]]:

    def features_agg_fn(rows):
        half = len(rows.iloc[0]) // 2
        return np.unique(rows.iloc[0][:half] + rows.iloc[1][half:]), np.unique(
            rows.iloc[1][:half] + rows.iloc[0][half:])

    children = selected_ind_list.groupby(selected_ind_list.index // 2).agg({'Features': features_agg_fn}).Features.explode()

    children = children.reset_index(drop=True)
    updated_children = []

    for child in children:
        child = child.tolist()

        while len(child) < num_features:
            child = child + [random.choice(list_total_features)]
            child = np.unique(child).tolist()

        updated_children.append(child)

    return updated_children

New random individuals

To increase diversity in our algorithm, next to the previously created children, an amount of num_new_individuals random individuals are created.

The algorithm stops once the predefined total number of generations is reached. At the end, the fittest individual ever created is considered as the (sub) optimal solution. The create_population function can be reused for this step.

Full process

Below you can find the full process and all the functions put together.

# GA parameters
num_features = 10
num_individuals = 100
n_individuals_to_select = 40
num_new_individuals = 10
id_start = 1
num_generations = 10
penalization = True

total_list_individuals = pd.DataFrame()

# Initial population
population = create_population(list_total_features, num_features, num_individuals)

for i in range(num_generations):

  evaluated_individuals_df = population_evaluation(population, id_start)
  id_start = id_start + len(population)
  total_list_individuals = total_list_individuals.append(evaluated_individuals_df, ignore_index=True)

  selected_individuals_list = population_selection(evaluated_individuals_df, n_individuals_to_select)

  children_list = population_reproduction(selected_individuals_list, list_total_features, num_features)

  new_random_individuals = create_population(list_total_features, num_features, num_new_individuals)
  population = new_random_individuals + children_list

print(f"The most fitted individual has an AUC score of: {max(total_list_individuals['Fitness']):.2f}.")

Conclusion

That’s it! You have implemented a full genetic algorithm using roulette wheel selection. The final solution, optimal or suboptimal, will depend on the complexity of your problem, the computation capabilities, the amount of individuals and generations you implement and the amount of total possible combinations. The success of genetic algorithms also depends on how well the problem can be represented as a genetic optimisation problem, as fitness inheritance is not always guaranteed in all problems. For us at Just Eat Takeaway.com, it was possible to find a reduced set of features that had similar fitness that with the full features set, even while having 2.94e+25 possible combinations!

Special thanks to Max Knobbout for his contribution to this article!

Just Eat Takeaway.com is hiring! Want to come work with us? Apply today.