Stories by Alex Molas on Medium

How to initialize your bias.

Alex Molas — Thu, 23 Feb 2023 23:00:42 GMT

tldr

Initializing correctly the bias of the last layer of your network can speed up the training process. In this post, I show first how to derive analytically the best values for the biases, and then I run an experiment to show the impact of using the correct bias.

In particular, the best biases are

Classification problem with M classes with frequencies F_i, such that F_1 + F_2 + … + F_M = 1, using softmax activation and categorical cross entropy loss

Regression problem using L² penalization and linear activation

Regression problem using L¹ penalization and linear activation

Photo by Alina Grubnyak on Unsplash

Motivation

These last weeks at work I’ve tuned a neural network that is used to predict arrival times. Basically, the network receives a representation of Stuart’s platform state (where are the drivers, where are the packages, etc.) and outputs the estimated time of arrival of some drivers. We decided to use a deep learning approach to avoid doing boring and unmaintainable feature engineering, but the problem then was to choose the model architecture. If we were solving an image classification problem it would have been trivial to design the architecture, in fact, we wouldn’t need to design anything, just take ResNet50 and fine-tune it. However, our problem is not standard in the deep learning world, so we couldn’t rely on pre-trained models or copy the architecture of previously successful models. We ended up defining an architecture based on convolutions, self-attention, and some dense layers here and there. The results were pretty good -it beat the previous model by +30%- and the model was deployed and everyone was happy.

However, not everything is always that easy, and at some point, we noticed that our model was overfitting. This wasn’t surprising since the model architecture and training process was never tuned. We just took our initial idea, run some experiments, changed some hyper-params by hand and called it a day. But now that the model is deployed and the stakeholders are happy we are working on tuning the model and making it more competitive. To do so I started with the great post by the great Karpathy here. It’s not the first time I read it, but this time one of the points called specially my attention.

verify loss @ init. Verify that your loss starts at the correct loss value. E.g. if you initialize your final layer correctly you should measure -log(1/n_classes) on a softmax at initialization. The same default values can be derived for L2 regression, Huber losses, etc.

What does Karpathy mean by verifying that your loss starts at the correct value? How can we achieve the -log(1/n_classes) loss on a softmax? Which are the respective initializations for L2 regressions, Hubber, etc? In this post, I'll show how to initialize the network to fulfil these requirements and their implications.

Problem statement

We want to solve the problem of

Which is the best initialization scheme for our network layers?

This is a broad question and has been addressed in a lot of works, such as Glorot and He (add references). In these works, the authors initialize the weights of the layers by sampling from a distribution with some optimized parameters. For instance, Glorot proposes to sample from N(0, 2/(n_i+n_o) and He proposes to sample from N(0, 2/n_i). The common thing between these approaches is that the mean of the distribution is 0. However, these works focus on the initialization of the weights of all the matrices of our network, while Karpathy talks only about the initialization of the last layer. Then, instead of solving the general question about how to initialize all the layers of the network, I will address the simplified problem of

Which is the best initialization scheme for the last layer of our network?

Solution

In this section, I will answer the above question for several deep learning architectures.

Classification

Let’s start with a classification problem. We can define a neural network of depth N as a set of weight matrices {W_1, W_2, …, W_N}, a set of biases {b_1, b_2, …, b_N}, a set of non-linear activations {f_1, f_2, …, f_{N-1}} and a final activation layer f_N = softmax. Then the output of the network is defined by the recurrence

where x_0 is the input of our network. Now, we are interested in the input to the last layer, ie the softmax layer. As we saw before, the usual initialization uses a normal distribution with mean zero, therefore, if our input of the network is standardized (ie: it has mean zero) we can expect the input to the last layer to have an average W_N * x_N = 0. Therefore, the output of the last layer has the form

Cool, we have our first result, let’s see now how can we use this to optimize the initial values of b_N. The standard approach to classification problems is to use the cross-entropy loss

where M is the number of classes of our problem. Therefore, if we want to minimize the expected loss at initialization our best guess is to set b_N such that the network output follows the same distribution of our data. This is if our training dataset has M classes that appear with frequency F_i such that F_1 + F_2 + … + F_M = 1 we would like ypred_i = F_i, ie the prediction for the i class has the same probability as in the training dataset. With such an output the expected loss is then

where we have used that at initialization y_i and ypred_i are independent, so we can write E[y_i log ypred_i] = E[y_i]log(E[ypred_i]). Notice that if the problem is balanced then we have F_i = 1/M and the expected loss at initialization is L=-log 1/M as Karpathy says.

Nice, now we know which value to expect for the loss for a correctly initialized last layer, but now we need to know how to set b_N such that the output has the same distribution as the training dataset. To do so we can use the definition of softmax

Now, using that the last layer is softmax(b_N) we can write our constraint as

which has the solution

Therefore, setting k=1, the optimal initialization bias for our last layer has the form

Regression

In the last section, I’ve shown how to derive the optimal biases at initialization for a classification problem. In this section, I’ll show how to do the same for a regression problem. The main differences between these approaches are (1) the loss we are using, (2) the last layer activation, and (3) the dimension of the output. In regression, the output is usually 1-dim, ie: we’re just predicting one value, and the last layer activation is just the identity. The more frequent these problems are

and

Using the same rationale as before, we want to minimize these losses at initialization. It’s known that without any further information, the value that minimizes MSE is mean(y) and the value that minimizes MAE is median(y). Therefore, since the output of the last layer is just o = b_N (we’re using the identity activation) we have that the values that minimize loss at initialization are

and

The expected loss at initialization for the MSE is then the variance since

and for the MAE

which I don’t know if it has a specific name.

In the original post, Karpathy says that you can also find the optimal values for the Hubber loss, however, unlike with MAE and MSE there’s no closed form for the value that minimizes the Hubber loss ( explaination here). However, we could obtain the value that minimizes the Hubber loss for our dataset numerically and then use it as the bias of our layer.

Results

In the previous sections, I explained how to determine the best initial bias through mathematical analysis. However, in the real world, things are not always precise, and data can show that our assumptions were incorrect. In this section, I will conduct some experiments to see the impact of initializing biases correctly.

To conduct these experiments, I will use the CIFAR-10 dataset. I have made the problem unbalanced by sampling each class. Then, I created two CNN networks: one with the optimal bias strategy defined above and another with the standard initialization. You can find the code used to generate the models and datasets in this notebook.

The results are summarized in the following plot. We can see that the network with the optimized initial bias learns faster than the one with the normal network. This effect disappears if we train the network for a sufficient number of epochs. However, training large models is often costly. Therefore, if we can save time and money by setting the correct bias, it is worthwhile.

Originally published at https://www.alexmolas.com on February 23, 2023.

Analyzing Chess960 Data

Alex Molas — Fri, 13 Jan 2023 17:38:23 GMT

Using more than 14M Chess960 games to find if there’s a variation that’s better than the others

In this post, I analyze all the available Chess960 games played in Lichess. With this information, and using Bayesian A/B testing, I show that there are no starting positions that favor any of the players more than other positions.

The original post was published here. All the images and plots, unless stated otherwise, are by the author.

Photo by Hassan Pasha on Unsplash

Introduction

The World Fischer Random Chess Championship recently took place in Reykjavik, with GMHikaru emerging victorious. Fischer Random Chess, also known as Chess960, is a unique variation of the classic game that randomizes the starting position of the pieces. The intention behind this change is to level the playing field by eliminating the advantage of memorized openings and forcing players to rely on their skill and creativity.

As I followed the event, one question came to mind: are there certain initial Chess960 variations that give one player an unfair advantage? As it stands, the standard chess initial position gives white a slight edge, with white usually winning around 55% of game points (ref)) and Stockfish giving white a score of +0.3 (ref)). However, this edge is relatively small, which is likely one of the reasons why this position has remained the standard.

There is some work already done about this topic. Ryan Wiley wrote this blog post where he analyzes some data from lichess and reach the conclusion that some variations are better than others. In the post, he says that some positions have a higher winning probability for white pieces, but he doesn’t show how significant is this claim. This made me think that maybe his findings need to be revisited. He also trains a ML model on the data in order to determine the winner of a game using as inputs the variation and the ELOs of the players. The resulting model has an accuracy of 65%.

On the other hand, there’s also this repo with the statistics for 4.5 millions games (~4500 games per variation). In this repo the biggest difference for white and black are listed, but again no statistical significance is given.

Finally, there’s also some research about this topic focused in computer analysis. In this spreadsheet there’s the Stockfish evaluation at depth ~40 for all the starting positions. Interestingly there’s no position where Stockfish gives black player an advantage. There’s also this database with Chess960 games between different computer engines. However, I’m currently only interested in analyzing human games, so I’ll not put a lot of attention to this type of games. Maybe in a future post.

Since none of the previous work has addressed the problem of assigning statistical confidence to the winning chances to each variation of Chess960 I decided to give it a try.

tl;dr

In this post I analyze all the available Chess960 games played in Lichess. With this information I show that

using bayesian AB testing I show that there are no starting positions that favor any of the players more than other positions
also, the past winning rate of a variation doesn’t predict the future winning rate of the same variation
and stockfish evaluations don’t predict actual winning rates for each variation
finally, knowing the variation being played doesn’t help to predict the winner

Data

Lichess—the greatest chess platform out —maintains a database with all the games that have been played in their platform. To do the analysis, I downloaded ALL the available Chess960 data (up until 31–12–2022). For all the games played I extracted the variation, the players ELO and the final result. The data is available on Kaggle. The scripts and notebooks to donwload and process the data are available on this repo.

The data I used is released under the Creative Commons CC0 license, which means you can use them for research, commercial purpose, publication, anything you like. You can download, modify and redistribute them, without asking for permission.

Mathematical framework

Bayesian A/B testing

According to the prior work mentioned above some variations are better than others. But how can we be sure that these differences are statistically significant? To answer this question we can use the famous A/B testing strategy. This is, we start with the hypothesis that variation A has bigger winning chances than variation B. The null hypothesis is then that and A and B have the same winning rate. To discard the null hypothesis we need to show that the observed data is so extreme under the assumption of the null hypothesis that it doesn’t make sense to still believe in it. To do that we’ll use bayesian A/B testing 1.

In the bayesian framework, we assign to each variation a probability distribution for the winning rate. This is, instead of saying that variation A has a winning rate of X% we say that the winning rate for A has some probability distribution. The natural choice when modelling this kind of problem is to choose the beta distribution (ref).

The beta distribution is defined as

where B(a, b) = Γ(a)Γ(b)/Γ(a+b), Γ(x) is the gamma function, and for positive integers is Γ(n) = (n-1)!. For a given variation, the parameter α can be interpreted as the number of white wins plus one, and β as the number of white losses plus one.

Now, for two variations A and B we want to know how probable is that the winning rate of A is bigger than the winning rate of B. Numerically, we can do this by sampling N values from A and B, namely w_A and w_B and compute the fraction of times that w_A > w_b. However, we can compute this analytically, starting with

Notice that the beta function can give huge numbers, so to avoid overflow we can transform it using log. Fortunately, many statistical packages have implementations for the log beta function. With this transformation, the addends are transformed to

This is implemented in python, using the scipy.special.betaln implementation of log B(a, b) , as

import numpy as np
from scipy.special import betaln as logbeta

def prob_b_beats_a(n_wins_a: int, 
                   n_losses_a: int, 
                   n_wins_b: int, 
                   n_losses_b: int) -> float:

  alpha_a = n_wins_a + 1
  beta_a = n_losses_a + 1

  alpha_b = n_wins_b + 1
  beta_b = n_losses_b + 1
  probability = 0.0
  for i in range(alpha_b):
    total += np.exp(
      logbeta(alpha_a + i, beta_b + beta_a)
      - np.log(beta_b + i)
      - logbeta(1 + i, beta_b)
      - logbeta(alpha_a, beta_a)
    )
  return probability

With this method, we can compute how probable is for a variation to be better than another, and with that, we can define a threshold α such that we say that variation B is significantly better than variation A if Pr(p_A>p_B)<1-α.

Below you can see the plot of some beta distributions. In the first plot, the parameters are α_A= 100, β_A=80, α_B=110 and β_B=70.

Beta distributions with parameters α_A= 100, β_A=80, α_B=110 and β_B=70

In this second plot, the parameters α_A= 10, β_A=8, α_B=11 and β_B=7.

Beta distributions with parameters α_A= 10, β_A=8, α_B=11 and β_B=7

Notice that, even in both cases the winning rates are the same, but the distributions look different. This is because in the first case, we’re more sure about the actual rate, and this is because we’ve observed more points than in the second case.

Family-wise error rate

Usually, in A/B testing one just compares two variations, eg: conversions in a website with white background vs blue background. However, in this experiment, we’re not just comparing two variations, but we’re comparing all the possible pairs of variations -remember that we want to find if there is at least one variation that is better than another variation- therefore, the number of comparisons we are doing is 960*959/2 ~ 5e5. This means that using the typical value of α=0.05 is an error because we need to take into consideration that we’re doing a lot of comparisons. For instance, assuming that the winning probabilities distributions are the same for all the initial positions and using the standard one would have a probability

of at least observing one false positive! This means that even if there’s no statistically significant difference between any pair of variations we’ll still observe at least one false positive. If we want to keep the same α but increase the number of comparisons from 2 to we need then to define an effective α like

and solving

Plugging our values we finally get α_eff =1e-7.

Train/Test split

In the previous sections, we developed the theory to determine if a variation is better than another variation according to the observed data. This is, after having seen some data we build a hypothesis of the form variation B is better than variation A . However, we can’t test the truth of this hypothesis using the same data we used to generate the hypothesis. We need to test the hypothesis against a set of data that we haven’t used yet.

To make this possible we will split the full dataset into two disjoint train and test datasets. The train dataset will be used together with the bayesian A/B testing framework to generate hypotheses of the form B>A. And then, using the test dataset we’ll check if the hypotheses hold.

Notice that this approach makes sense only if the distribution of winning rates doesn’t change over time. This seems a reasonable assumption since, AFAIK, there haven’t been big theoretical advances that have changed the winning probability of certain variations during the last few years. In fact, minimizing the theory and preparation impact on game results is one of the goals of Chess960.

Data preparation

In the previous sections we have implicitly assumed that a game can be either won or lost, however, it can also be drawn. I’ve assigned 1 point for a victory, 1/2 point for a draw, and 0 points for a loss, which is the usual approach in chess games.

Results

In this section, we will apply all the techniques explained above to the lichess dataset. In the dataset, we have more than 13M games, which is ~14K games per variation. However, the dataset contains games for a huge variety of players and time controls (from ELO 900 to 2000, and from blitz to classic games). Therefore, doing the comparisons using all the games would mean ignoring confounder variables. To avoid this problem I’ve only used games for players with an ELO in the range (1800, 2100) and with a blitz time control. I’m aware that these filters do not resemble the reality of top-level contests such as the World Fischer Random Chess Championship, but in lichess data, there are not a lot of classical Chess960 games for high-rated players (>2600), so I will just use the group with more games. After applying these filters we end up with a dataset with ~2.4M games, which is ~2.5K games per variation.

The train/test split has been done using a temporal split. All the games prior to 2022-06-01 are part of the training dataset, and all the games after that date are part of the testing dataset, which accounts for ~80% of the data for training and ~20% for testing.

Generating hypotheses

The first step is to generate a set of hypotheses using A/B testing. The number of variation pairs to compare is pretty big (1e5) and testing all of them would take a lot, so we’ll just compare the 20 variations with the highest winning rates against the 30 variations with the lowest winning rates. This means we’ll have 900 pairs of variations to compare. Here we see the pair of variations with the bigger difference in the train dataset

Notice that the α for these variations is bigger than α_eff, which means that the difference is not significant. Since these are the variations with the higher difference we know that there’s not any variation pair with a statistically significant difference.

Anyway, even if the difference is not significant, with this table one can hypothesize that variation rnnqbkrb is worse than variation bbqrnkrn. If we check these variation values in the test dataset we get

Notice that the “bad” variation still has a winning rate lower than the “good” variation, however, it has increased from 0.473 to 0.52, which is quite a lot. This brings a new question: do past variation performance guarantee future performance?

Past vs Future performances

In the last section, we have seen how to generate and test hypotheses, but we have also noticed that the performance of some variations changes over time. In this section, we’ll analyze this question more in detail. To do so, I have computed the winning rate in the train and test datasets and plotted one against the other.

Train vs Test winning rates

As we can see, there’s no relation between past and future winning rates!

Evaluation vs Rates

We’ve seen that past performances do not guarantee future performances, but do Stockfish evaluations predict future performances? In the following plot, I show the evaluation of Stockfish for each variation and the corresponding winning rate in the dataset.

Stockfish Evaluation vs Winning rate for each variation

Machine learning model

Until now we’ve seen that there are no better variations in the Chess960 game and that previous performance is no guarantee of future performance. In this section, we’ll see if we can predict which side is going to win a match based on the variation and the ELO of the players. To do so I’ll train an ML model.

The features of the model are the ELO of the white and black player, the variation being played, and the time control being used. Since the cardinality of the variation feature is huge I’ll use CatBoost, which has been specifically designed to deal with categorical features. Also, as a baseline, I’ll use a model that predicts that white wins if White ELO > Black ELO, draws if White ELO == Black ELO, and losses if White ELO < Black ELO. With this experiment, I want to see which is the impact of the variation in the expected winning rate.

In the next tables, I show the classification reports for both models.

CatBoost model

Baseline model

From these tables, we can see that the CatBoost and the baseline model have almost the same results, which means that knowing the variation being played doesn’t help to predict the result of the game. Notice that the results are compatible with the ones obtained here (accuracy ~65%), but in the linked blog it’s assumed that the knowing the variation helps to predict the winner, and we have seen that this is not true.

Conclusions & Comments

In this post, I’ve shown that

using the standard threshold to determine significant results is not valid when having more than one comparison, and it needs to be adjusted.
there are no statistically significant differences in the winning rates, ie: we can’t say that a variation is preferable for white than another.
past rates don’t imply future rates.
stockfish evaluations don’t predict winning rates.
knowing which variation is being played doesn’t help to predict the result of a match.

However, I’m aware that the data I’ve used is not representative of the problem I wanted to study in the first place. This is because the data accessible at Lichess is skewed towards non-professional players, and even though I’ve used data from players with a decent ELO (from 1800 to 2100) they are pretty far from the players participating in the Chess960 World Cup (>2600). The problem is that the number of players with an ELO >2600 is very low (209 according to chess.com), and not all of them play regularly Chess960 in Lichess, so the number of games with such characteristics is almost zero.

Analyzing Chess960 Data was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Can Random Forests Overfit?

Alex Molas — Fri, 30 Dec 2022 09:15:37 GMT

Introduction

If you’re like me — a DS/MLE with practical experience — you may be surprised by the title of this post. At first, I assumed the answer was a clear “yes.” However, when someone with more experience asked me this question, I realized that the answer might not be as obvious as I thought. And after some googling I found these quotes by Leo Breiman, the creator of the Random Forest algorithm

Random forests does not overfit. You can run as many trees as you want. (website)

and

Random forests do not overfit as more trees are added, but produce a limiting value of the generalization error. (paper)

Can random forests really not overfit? It seems counterintuitive, but in this post I will explain with math and experiments why it is possible under certain conditions.

To write this post I’ve used this PhD dissertation by Gilles Loupe (one of the creators of sklearn), this blog post that studies the same question, and the original paper by Leo Breiman.

Maths

Model definition

Let’s start by defining mathematically what we mean by a random forest. Assume that we already know how to create single classification trees. Let’s call the dataset D, T a decision tree , and θ the hyperparameters of the tree. The model generated by these pieces is then ψ_{D, θ} = T(D, θ), . For the sake of simplicity, let’s assume that all hyperparameters are fixed and that the only free hyperparameter is the random seed, eg: the values max_depth , min_sample_split , min_sample_leaf , etc. are fixed.

Random forests are a combination of trees such that each tree depends on the values of a random dataset sampled independently and with the same distribution for all trees in the forest. Therefore, for a set of M randomized trees {ψ_{D, θi} | i = 1,…,M} that learn from the same dataset D we define a random forest model as

Random Forest model

Bias-Variance decomposition

Now that we know how to build a random forest let’s study its bias and variance, which are the key ingredients of overfiting. For a pair x and y, where x is the vector of features, and y is the target value, let’s define for a single tree ψ_{D, θi}

and

Also, we know that the expected generalization error of a model decomposes as

We’re interested in the generalization error that our random forest has depending on its hyperparameters. Let’s start with the bias, which is defined as the difference between the actual value and the model’s expected value

because random variables θi are independent and follow the same distribution. Therefore

which means that the bias of the random forest is the same as the bias of a single tree, so combining decision trees has no effect on the bias.

Let’s look now at what happens with the variance. After some maths (page 66 of this paper) it can be shown that the variance of a random forest of M trees is

where ρ(x) is the Pearson correlation between the predictions of two decision trees.

Therefore, the generalization error is given by

Generalization error of a Random Forest model

Overfitting

Now that we know how to compute the generalization error of our model we can study how it behaves when changing the hyperparameters. Since each tree is built using only a bootstrap of the full dataset we have ρ(x)<1, ie: different trees in the random forest give different predictions for the same input. Therefore, if K>L we have

Also, when M →∞ we have

As a result, the expected generalization error of the random forest is smaller than the expected error of a single decision tree. This is, as we add more trees to the random forest the generalization error reduces.

Therefore, adding more trees to the model doesn’t make the model more prone to overfitting, in fact it reduces the generalization error! But it doesn’t mean that a random forest can’t overfit. For example, imagine the simplest case of a random forest with only one tree. If we allow the tree to have an infinite depth it can learn almost perfectly the training data (up to collisions), but then it’ll fail to generalize to other data.

Experiment

In the above section, we’ve seen mathematically that adding more trees to the model shouldn’t make the model overfit. In this section, we are going to take a more practical approach and show it directly with some experiments.

To study the effect of the number of trees on overfitting we’ll generate a synthetic dataset using the sklearn.datasets.make_regression method.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=10_000, 
                       n_features=50, 
                       n_informative=30, 
                       noise=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3,
                                                    random_state=42)

Now we can train multiple random forests, each one with a different number of trees.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rf = RandomForestRegressor()
n_estimators = []
train_mse = []
test_mse = []
for n in range(1, 50, 1):
  rf.n_estimators = n
  rf.fit(X_train, y_train)
  y_train_predicted = rf.predict(X_train)
  y_test_predicted = rf.predict(X_test)
  mse_train = mean_squared_error(y_train, y_train_predicted)
  mse_test = mean_squared_error(y_test, y_test_predicted)
  n_estimators.append(n)
  train_mse.append(mse_train)
  test_mse.append(mse_test)

And we can finally plot the MSE for both train and test datasets.

As you can see, there’s a gap between the train and test MSE which means that there’s some overfitting. However, as we add more trees the gap between the train and test MSE doesn’t increase, which means that adding more trees doesn’t make the model more prone to overfitting. In fact, as we add more trees to the model the gap between the curves reduces. In the following plot we see how Gap = Test MSE - Train MSE reduces as n_estimators increases

Notice also that the gap between the train and test curves — aka overfitting — can be changed by tuning other hyperparameters, such as max_depth . In the next plot I show how the gap changes depending on max_depth and n_estimators.

Conclusions

Yes, random forests can overfit since a single tree can overfit.
The bias of random forests is the same as the bias of a single tree, however, the variance decreases as we add more trees to the model, and this is where the power of random forests comes from.
Overfitting in a random forest model can be tuned using other hyperparameters such as max_depth , but increasing n_estimators doesn’t increase the gap between train and test performance.

Vectorizing difficult operations: a trick for filtering lists of lists.

Alex Molas — Fri, 15 Jul 2022 09:15:19 GMT

We all know that vectorizing our operations is the best practice, but what to do when there’s no trivial way of doing it?

In this post, I show how to use a boolean trick to vectorize complex operations involving lists.

The original story was posted here.

In this post, we’ll see how to accelerate a specific type of operation involving lists. Photo by Clay Banks on Unsplash

One of the most covered topics about pandas optimization is how to apply functions over columns. One option is to use apply but is not a good idea (maybe this is one of the topics with more posts ever). It’s known that the optimal solution is to use vectorization, however, some functions that can’t be vectorized easily. What to do in these cases?

In this post, I’ll present a trick to vectorize operations that involve checking the intersection between lists and other list operations. In particular, I will show how using boolean algebra enables vectorization and can speed up our computations.

All the code and data used in this post is available in this repo.

Introduction

Imagine you are working on an ice cream start-up, which sells ice-creams of different flavors. There are ice creams of only one flavor, ice creams of two flavors, ice creams of three flavors, and so on. Every time that an ice cream is purchased your company stores information about the order and the rating -from 1 to 10- that the client gave to the ice cream.

The resulting table looks like this

Now, your company wants you to answer some questions

When the flavor X is included in the ice cream, which is the average rating?
When the flavors X and Y are included in the ice cream, which is the average rating?

For example, using the table above the answers would be

and

In general, you want to be able to answer the question

When flavors {X_1, X_2, ..., X_N} are included in the ice cream, which is the average rating?

The question now is how to run this analysis efficiently. More specifically, how to filter a set of lists based on another list fast. Surprisingly, the answer to this question comes from boolean algebra and binary code. Stay with me and I’ll show you how to vectorize these queries!

Which is the best ice cream flavor? Photo by Lama Roscu on Unsplash

Problem Statement

In this section, I’ll generalize the above problem using some mathematical notation.

Given a vocabulary v, two lists of lists, L=(l_1, ..., l_n) and M = (m_1, ..., m_n)where l_i∈ v and m_i∈ v and a predicate of the form P(l_1, m_i)the question is to find an efficient way to obtain the indices {i | P(l_i, m_i)

For example

Given the lists L=((1, 2), (2, 3, 4), (3)) and M=((1), (1, 2), (1, 2, 3))
Given the predicate P(l_i, m_i) = m_i ∈ l_iwhich checks if a list is contained in another list.
Then, the resulting index is 1, since (1)∈ (1, 2).

Solutions

Even though there are multiple options to work with datasets in python (polars, dask, vaex, etc.) let’s assume that we’re using pandas. Let’s also assume that the dataframe looks like

and we want to find all the indices where the list l_i is contained in m_i.

We’ll explore two different solutions to the problem of filtering a dataframe using lists. The first one it’s going to be a brute force one (using apply), and the other one it’s going to apply some boolean algebra tricks to vectorize the process.

Brute force solution

The most immediate solution is just to iterate over all the rows of the dataframe and check if the intersection is the empty set or not.

def brute_force_not_null_intersection(df: pd.DataFrame, c1: str, c2: str):
    def f(r):
        e1 = r[c1]
        e2 = r[c2]
        return len(set(e1) & set(e2)) != 0
    return df.apply(f, axis=1)

However, this is not efficient, since we know that vectorization is key, and using apply is not a good idea. So the natural next step would be to vectorize this function. But how can we do it? It doesn’t seem trivial how can we write a vectorizable method for checking if a list is in another list.

Binary representation

Here’s where the interesting things start. In this section, we’ll study how we can represent our lists as binary numbers and vectorize the operations.

The first step is to translate every list into an integer. To do that we can use the following algorithm:

Given sets L and M, construct the vocabulary v.
Express every element in L and Mas a binary number, where in position jthere’s a 1if v[j]∈ l_iand 0otherwise.

For example, the binary representation for l=(b,d) if v=(a,b,c,d) is 1010.

In python, we can do this transformation with the following snippet

def list_to_int(vocabulary: Sequence[str], l: Sequence[str]):
    word_to_exponent = {k: i for i, k in enumerate(vocabulary)}
    return sum(2 ** word_to_exponent[k] for k in l)

With this approach, we can transform our lists into integers, and we know that doing vectorized operations in pandas with integer columns is easy-peasy, amazing!

Binary operations

Cool, we know how to represent sets as binary numbers, but how can we use that to check if a list l contains list m? If we dust off the boolean algebra university books we’ll see it’s as easy as doing l == (l & m). Let’s work through an example

Given v = (a, b, c, d), l = (a, b, c) and m = (a, b).
The binary representation is l -> 0111 and m -> 0011
Now we want to check m == (l & m).
l & m = 0111 & 0011 = 0011
m == (l & m) is 0011 == 0011, which holds. Therefore, we can say that l contains m.

Here’s the python implementation of this predicate

def vectorized_not_null_intersection(df: pd.DataFrame, c1: str, c2: str) -> pd.Series:
    return (df[c1] & df[c2]) != 0

The great thing about this trick is that allows us to take advantage of vectorization. In the following section, we’ll compare the brute force and the binary trick approaches.

Results

To compare both approaches we’re going to use a synthetic dataset with N=10^6 examples, a vocabulary size of |v|=15, and a sequence maximum length of 5. To generate your own examples you can use this script.

The results for the brute force algorithm are

%%time

index_brute = brute_force_not_null_intersection(df, 'elements_1', 'elements_2')

>> CPU times: user 10.9 s, sys: 125 ms, total: 11 s
>> Wall time: 11.1 s

and using the binary trick

%%time

converter = Converter(vocabulary=vocabulary)
df['elements_1_bin'] = df['elements_1'].map(converter.convert)
df['elements_2_bin'] = df['elements_2'].map(converter.convert)
index_vec = vectorized_not_null_intersection(df, 'elements_1_bin', 'elements_2_bin')

>> CPU times: user 3.54 s, sys: 52.4 ms, total: 3.6 s
>> Wall time: 3.63 s

This is a speed-up of x3, which is amazing. However, most of the time was spent in the map operation when constructing the binary representations of the lists, but once these representations are created, the not_null_intersection is much faster!

%%time

index_vec = vectorized_not_null_intersection(df, 'elements_1_bin', 'elements_2_bin')

>> CPU times: user 3.61 ms, sys: 1.6 ms, total: 5.21 ms
>> Wall time: 4.05 ms

A speed-up of x2500, OMG! This means that, after an initial overhead of translating lists to integers, we can use vectorized operations to solve our problem and obtain an incredible reduction in computation time.

Conclusions

In this post, we have seen how to transform lists into binary numbers and then use boolean algebra to vectorize difficult operations. With this approach, we’ve seen a huge boost (x2500) in the performance.

Why do we minimize the mean squared error?

Alex Molas — Fri, 27 May 2022 15:55:13 GMT

Have you ever wondered why we minimize the squared error? In this post, I’ll show the mathematical reasons behind the most famous loss function.

Photo by Peter Secan on Unsplash. Minimizing the squared error is like trying to reach the bottom of a valley.

One of the first topics that one encounters when learning machine learning is linear regression. It’s usually presented as one of the simplest algorithms for regression. In my case, when I studied linear regression — back in the days during my physics degree — I was told that linear regression tries to find the line that minimizes the sum of squared distances to the data. Mathematically this is y_p=ax+b, where x is the independent variable that will be used to predict y_t, y_p is the corresponding prediction, and a and b are the slope and intercept. For the sake of simplicity let’s assume that x∈R and y∈R. The quantity that we want to minimize — aka the loss function — is

MSE Loss Function

The intuition behind this loss is that we want to penalize more big errors than small errors, and that’s why we’re squaring the error term. With this particular choice of the loss functions, it’s better to have 10 errors of 1 unit than 1 error of 10 units — in the first case, we would increase the loss by 10 squared units, while in the second case we would increase it by 100 squared units.

However, while this is intuitive, it also seems quite arbitrary. Why use the square function and not the exponential function or any other function with similar properties? The answer is that the choice of this loss function is not that arbitrary, and it can be derived from more fundamental principles. Let me introduce you to maximum likelihood estimation!

Maximum Likelihood Estimation

In this section, I’ll present maximum likelihood estimation, one of my favorite techniques in machine learning, and I’ll show how can we use this technique for statistical learning.

Basics

First of all some theory. Consider a dataset X={x1,…,xn} of n data points drawn independently from the distribution p_real(x). We have also the distribution p_model(θ,x), which is indexed by the parameter θ. This means that for each θ we have a different distribution. For example, one could have p_model(θ,x)=θ*exp(−θ*x), aka the exponential distribution.

The problem we want to solve is to find θ* that maximizes the probability of X being generated by p_model(θ*,x). This is, for all the possible p_model distributions, which is the one that most likely could have generated X. This can be formalized as

and since the observations from X are extracted independently we can rewrite the equation as

While this equation is completely fine from a mathematical point of view, there are some numerical problems with it. In particular, we’re multiplying probability densities, and densities can be very small sometimes, so the total product can have underflow problems — ie: we can’t represent the value with the precision of our CPU. The good news is that this problem can be overcome with a simple trick: just apply log to the product and convert the product to a sum.

Since logarithm is a monotonic increasing function, this trick doesn’t change the argmax.

Example

Let’s work out an example to see how one can use this technique in a real problem. Imagine you have two blobs of data that follow a gaussian distribution — or at least that is what you suspect — and you want to find the most probable center of these blobs.

Points were generated by two Gaussian distributions with centers at (3, 3) and (-3, -3) respectively.

Our hypothesis is that these distributions follow a Gaussian distribution with unit covariance, ie: Σ=[[1,0],[0,1]]. Therefore, we want to maximize

where θ=(θ_1,θ_2), and θ_1 and θ_2 are the centers of the distributions.

Using the following snippet one can get the most plausible centers according to MLE

from scipy.optimize import minimize
import numpy as np
class ExampleMLE:
    def __init__(self, x1, x2):
        self.x1 = x1
        self.x2 = x2
    def loss(self, x):
        mu1 = (x[0], x[1])
        mu2 = (x[2], x[3])
        log_likelihood = (- 1/2 * np.sum((self.x1 - mu1)**2) 
                          - 1/2 * np.sum((self.x2 - mu2)**2))
        return - log_likelihood # adding - to make function minimizable
# x1 and x2 are arrays with the coordinates of point for blob 1 and 2 respectively.
p = ExampleMLE(x1=x1, x2=x2) 
res = minimize(p.loss, x0=(0, 0, 0, 0))
print(res.x)

In the following figure, we can see the most plausible generating distributions according to MLE. In this specific case, the real centers are in (3, 3) and (-3, -3), and the optimal found values are (3.003, 3.004) and (-3.074, -2.999) respectively, so the method seems to work.

Original points and the contour plot of the optimal Gaussians found by MLE.

Using MLE for predictions

In the last section, we have seen how to use MLE to estimate the parameters of a distribution. However, the same principle can be extended to predict y given x, using the conditional probability P(Y|X;θ). In this case, we want to estimate the parameters of our model that better predict y given x, this is

and using the same tricks as before

Now, the process that generates the real data can be written as y=f(x)+ϵ, where f(x) is the function we want to estimate, and ϵ is the intrinsic noise of the process. Here we are assuming that x has enough information to predict y, and no amount of extra information could help us in predicting noise ϵ. One common assumption is that this noise is normally distributed, ie: ϵ∼N(0,σ²). The intuition behind this choice is that even knowing all the variables that describe the system, you will always have some noise, and this noise usually follows a Gaussian distribution. For example, if you take the distribution of heights of all the women, of the same age, and in the same town, you are going to find a normal distribution.

Therefore, a good choice for the conditional probability for our case is

where f^(x) is our model and is indexed by θ. This means that our model f will predict the mean of the Gaussian.

Plugging now the conditional probability in the equation we want to maximize we get

where y_ip is the output of our regression model for input x_i. Notice that both n and σ are constant values, so we can drop them from the equation. Therefore the function that we want to solve is

which is the same as minimizing the squared error loss!

But why? Our intuition behind the loss function was that it penalizes big over small errors, but what does this have to do with conditional probabilities and normal distributions? The point is that extreme values are very unlikely in a normal distribution, so they will contribute negatively to the likelihood. For example, for p(x)=N(x;0,1), log⁡ p(1)≈−1.42, while log ⁡p(10)≈−50.92. Therefore, when maximizing the likelihood we’ll prefer values of θ that avoid extreme values of (y_t−y_p)². So the answer to the question Why should we minimize MSE? is Because we’re assuming the noise is normally distributed.

Conclusions

We just saw that minimizing the squared error is not an arbitrary choice but it has a theoretical foundation. We also saw that it comes from assuming that the noise is distributed normally. The same procedure we studied in this post can be used to derive multiple results, for example, the unbiased estimator of the variance, Viterbi Algorithm, logistic regression, machine learning classification, and a lot more.

This story was originally published here: amolas.dev/posts/mean-squared-error/

Why do we minimize the mean squared error? was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Continuous Blackjack (i): Introduction and First Results

Alex Molas — Mon, 16 May 2022 16:38:56 GMT

Analytical and numerical perspectives of the problem of Continuous Blackjack

Photo by Erik Mclean on Unsplash

Introduction

This is the first of several posts where we will analyze the problem of playing continuous blackjack. In particular, we want to extend the work of Henry Charlesworth and Mu Zhao to other distributions. In this first post, we will reproduce the existing results and propose a way to extend them to other distributions. Then, we will analyze the case of the uniform distribution and show some interesting plots.

Problem Statement

Consider the continuous version of the blackjack game, which has the following rules:

Each player i starts with an empty stack S_i=[]. The value of this stack is defined as V_i=∑|S_i|S_ji.
A dealer draws random values from a distribution P(x).
The players play the game in turns. The turn of every player starts by receiving a number from the dealer and adding to its stack S_i. Then, the player has two options:
Decide to stop playing and finish the game with its current stack. Then the player’s turns end and it is the turn of the next player.
Get another number, add it to its stack, and go back to step 3. The player can do that all the times they want while V<1. If V>1 the player automatically loses and the turn of the next player starts.
The player with the highest value V wins the game.

The question is, what’s the best strategy to win this game?

Preliminaries

To start let us define F(t;a,b) to be the probability of landing in the interval R=(a,b), given that we are actually at t and we are going to keep playing if tR in the next turn, and (2) the probability that we don’t land on R on the next turn but in one of the following ones.

In mathematical terms this is

Uniform distribution

In the particular case of P=U[0,1]

And differentiating with respect to t on both sides we get

The constant K can be obtained using F(a;a,b)=b−a, so K=(b−a)e^a.

Notice how the equation doesn’t depend on the particular value of a, b, and t but on the distances between them. Thus, one can define the value W=b−a as the width of the range, and D=t−a as the distance from the current location to the lower bound. Then, the equation can be written as

Strategies

In this section, we will analyze how to use the results derived in the last section to find the optimal strategy in different scenarios.

1 vs 1

Let’s start with the simplest case: you’re playing against only one other person. The first player gets a point if the second player busts. If the first player’s score is s, then the second player has a probability of 1−F(0;s,1) to bust, this means that our probability to win if we stay at s is 1−F(0;s,1). Of course, if we could choose our s we would choose s=1, but since this is a random process we can’t choose s. The only thing we can choose is at which α we stop drawing numbers. This α is defined by the following point: where the probability of winning given that we stick at α is the same as the probability that we win given that we draw one more number.

Specific case: uniform distribution

For this section let’s assume that P=U[0,1] and that all the players start at x=0. In this case, the condition described by the above reasoning is written as

Notice that the left side is increasing, while the right side is decreasing.

Simulation results

Before moving to theoretical results, let’s try to find the optimal threshold via simulations. To do so, we will try each possible threshold and simulate 50000 games. Then we will check the probability of the first player winning as a function of the chosen threshold. The results are plotted in the following plot:

Probability of the first player winning vs stopping threshold

As you can see, the maximum winning probability is achieved around t∗=0.6, more specifically is t∗=(0.565±0.035). To compute the optimal threshold and its deviation we have run a Monte-Carlo simulation. This is, we have generated 1000 samples of the threshold-vs-probability curve using the expected winning probability and its 99% confidence interval. Then, we have computed the optimal threshold for each of these samples, and from this set of optimal thresholds, we have computed the average and standard deviation.

Knowing the range where the optimal value lies, let’s repeat the above simulations but only with thresholds between 0.5 and 0.65. Using this smaller range we get the optimal threshold t∗=(0.568±0.021). Finally, the expected winning probability at the optimal threshold is p_win(t∗)=(0.4278±0.0011). Let’s see now if the analytical results are compatible with the simulations.

Analytical results

Using the results from section Uniform Distribution and P=U[0,1], the equation for the optimal α is

Plot of the curves 1−(1−α)e^α and (1−α)−e^α(α−2)−e.

This is a non-linear equation without a closed form, but it can be solved approximately using numerical methods, and the solution is α∗≈0.5706. This means that when the accumulated value of our stack gets bigger than 0.5706 we should end our turn. This result is compatible with the range we found in our simulations.

Conclusions

In this first post, we have derived the generic equation for the probability of falling in a range (a,b) given that we’re currently at t. We have also derived the condition that the optimal thresholds have to fulfill. Finally, we have studied the particular case of the uniform distribution — both numerically and analytically — and studied its properties.

In the following posts, we will study how all these results change when the distribution P changes. We will also study the case where we play against N players instead of only one.

This story was originally published here.

Continuous Blackjack (i): Introduction and First Results was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

About data science and software engineering

Alex Molas — Thu, 12 May 2022 13:57:40 GMT

Why applying good software practices is important to become a better data scientist.

I still remember when I started working as an intern data scientist and I had to open my first pull request. I was very happy with my work, so I just committed all my changes together, opened a pull request, assigned it to my supervisor, and started waiting for the compliments to arrive — spoiler: the compliments never arrived.

Just to put things in context let me explain that I’m a physicist and I learned to code by myself. Also, up to that moment, I always worked alone, so I never had to share my code with no one besides me.

Going back to my first PR, I remember my supervisor coming directly to my desktop and having the following conversation

WTF is this? — said my supervisor while pointing at a method with more than 200 lines of code.

– I’m training a machine learning model. — I, with a smirk on my face, answered proudly.

– Yeah, but it looks that you’re also reading some datasets here and there. — asked him without smiling at all.

– Yeah, of course, I’m reading the data, splitting it in train and test, preprocessing it, and then training a model — at this point, the smirk of pride in my face already had disappeared.

– These are not good practices.

– These are not what?

– OMG, it shows that you are a physicist…

After that, he patiently spent the following months explaining to me some basic software engineering best practices, how to use git properly, and how to work in collaborative software projects. In the beginning, I didn’t understand why all those things were important, after all, I was a data scientist doing some cool machine learning models and not a software developer, so all these things didn’t apply to me. But, after some years of experience as a data scientist, now I have a different opinion, and now I’m convinced that these things matter a lot, even if you are a data scientist.

Why best practices matter

It’s widely known why best practices matter in pure software engineering projects, however, it’s not that clear why these practices matter when doing data science. Here I will try to explain why best software practices are crucial when developing a data science project.

For speed

When starting a new data science project, we are all excited to quickly reach final results. However, this mentality makes you slower in the long term. One of my co-workers told me some years ago: “data scientists with physics background do great work during the first steps of a project. They have great ideas and usually obtain good results in a few days. However, after this first burst of creativity and results, they get stuck, because the code they have written is now unmanageable, and any tiny change requires a lot of work.”. If you want to avoid that, following best practices and writing good code is going to help you to experiment better and faster.

Let me put an example to illustrate my point: imagine that you have a model that predicts the number of sales of your company. However, someone asks you to split the number of predicted sales into national and international sales. If your code was not well designed, you are going to spend a lot of hours changing a huge amount of code to make it work under these new requirements. However, if before developing your project you had spent some time designing the architecture of the project, now you would be able to try all the different variations of your experiment very fast and without fears of bugs.

On the other hand, I’ve met a lot of data scientists that believe that data engineers are going to do all this work for them, and from my experience, this never works like that. Data engineers are not data scientists’ minions, and they have their priorities and motivations, and usually, these motivations are not to clean your dirty code and make it scalable.

Also, you are going to share your projects with your colleagues, who will need to read and understand your code. Data science and machine learning are difficult on their own, so let’s try to not make them more difficult by writing bad software and making things obscure.

From my experience, if you want to get great results from your experiments you need to invest time in software best practices.

For science

As its name indicates, data scientists follow the scientific method, and by definition, the scientific method tests the hypotheses with experiments. Therefore, as a data scientist, you should pay special attention to how you test your hypotheses. This implies several things, such as using the correct mathematical framework, using the correct data, and having a well-designed setup. Here I want to focus on the latter.

Imagine a biologist experimenting in a cluttered lab, without properly labeling the different samples they are using, or without following the tool’s instructions. Imagine an astrophysicist trying to discover new exoplanets without following a strategy, just by pointing the telescope into random locations. Imagine the physicists at CERN fixing the LHC using duct tape. It’s clear that in all these cases, the outcome of the experiments is going to be probably useless. And while serendipity is a huge driver of science, it’s always followed by a meticulous scientific process. However, I’ve seen a lot of data experts making claims based on experiments with really badly designed scientific setups.

Curiously, all data scientists are aware that selecting the correct statistical tests or selecting the correct model is of capital importance for their produced work, but when it comes to their project architecture they don’t care at all. It is as if doctors were going to perform the most advanced and difficult operation in history, and after years of preparation and study they carry out the operation in a room with an old bed and with unsharpened scalpels.

There’s, again, this spread idea that data scientists should focus only on writing notebooks and scripts to prove hypotheses as fast as possible, and if the results they found are interesting then someone else is going to refactor and clean all their code. My point here is that’s almost impossible to prove hypotheses without the proper experimental setup, so if you want to be a good data scientist you should take care of your experimental setup.

Let me finish with one of my favorites quotes about machine learning and software development

Do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.
From “Machine Learning Rules” by Google

This story was originally posted here.

Feynman’s Restaurant Problem

Alex Molas — Mon, 09 May 2022 09:56:35 GMT

Introduction and solution to Feynman’s Restaurant Problem from a RecSys perspective

This story was originally posted here.

Photo by shawnanggg on Unsplash

You’re on holiday, and you’re going to spend the following days on a remote island in the Pacific. There are several restaurants and you would like to enjoy the most the local cuisine. The problem you face is that a priori you don’t know which restaurants are you going to enjoy and which not, and you can’t find the restaurants on Yelp, so you can’t use others' opinions to decide which restaurants to visit. Also, the number of restaurants is bigger than the number of days you’re going to stay on the island, so you can’t try all the restaurants and find the best one. However, since you love math you decide to find the best strategy to optimize your experience during your holidays. This is known as the Feynman’s Restaurant Problem.

This same problem can be interpreted from the perspective of the restaurant, where cookers want to recommend dishes to clients but they don’t know which dishes are the clients going to enjoy or not. So this problem falls under the category of recommender systems. More generically, this is the problem of recommending M items (with repetition) from a set of N items with an unknown rating.

The content of this post is heavily inspired by this solution. I have tried to explain some details which were obscure to me, and I also added some plots and code to give more intuition about the problem. Here you can read more about the history of the problem.

Mathematical formulation

Let’s try to formalize the problem using maths. First of all, let’s define D as the number of days you’re going to spend in the city and N as the number of restaurants. Let’s assume that you can rank all the restaurants with respect to your taste, and let’s call r_i the ranking of restaurant i. Let’s assume also that you can go to the same restaurant every day without getting tired of it, which means that if you know the best restaurant in the city you are going to go there always.

Notice, that since D you can’t try all the restaurants in the city, so you’ll never know if you have visited the best restaurant.

Notice that you will never know the actual rating ri. You only know if a restaurant is the best up to that moment or not. You can rank the restaurant that you have tried up to a given moment, but this “partial” ranking maybe it’s not the same as the “absolute” ranking. For example, if you have only tried 4 out of 10 restaurants you could have the rank [3, 2, 1, 4, ?, ?, ?, ?, ?, ?], but the real rank could be [5, 4, 3, 6, 1, 2, 7, 8, 9].

The function you want to optimize is

where r_i is the restaurant rating you visited on the day i.

Solution

Analytical

Every day you stay in the city you have two options: (1) try a new restaurant, or (2) go to the best restaurant you visited until that moment. We can think about this problem as an exploration-exploitation problem, this is, you explore the city for the first M days, and after that, you always go to the best place up to that moment for the following D−M nights.

Therefore, we can split the function to optimize as

where b_M,N is the ranking of the best restaurant you have tried during the first M days.

The only free parameter in our equation is M, so you want to find the value of M where the expected profit is maximized. This is

For the first term, applying linearity and knowing that = (N+1)/2 we get

Now we need to compute ⟨b_M,N⟩, which is the expected maximum value obtained after M draws from the range (1,N).

On one hand, we know that if you only try a restaurant the expected ranking is ⟨b_1,N⟩ = (N+1) / 2. On the other hand, if you try all the restaurants, the expected maximum ranking is of course ⟨b_N,N⟩ = N.

We can also compute ⟨b_2,N⟩. In this case, there only exists 1 pair of restaurants where 2 is the maximum, ie you choose the restaurants (1,2). There only exist 2 pairs of restaurants where 3 is maximum, ie (1,3) and (2,3). There only exist 3 pairs of restaurants where 4 is maximum, ie (1,4), (2,4), and (3,4). And so on. All together there are N(N−1)/2 possible pairs. Therefore

Now consider ⟨b_N−1,N⟩. This is, you try all the restaurants in the city except one. In this case, you’ll visit the best restaurant in N−1 cases, and in only one case you’ll skip the best restaurant. Therefore, the expected value is

From all these results, one can see the pattern and guess that

Putting it all together we finally have

which has a maximum at

Notice that the result doesn’t depend on N. This means that you don’t care about how many different restaurants are in the city, which sounds -at least for me- a little bit counterintuitive.

Notice also that if you want to try all the restaurants in the city without decreasing the expected profit you’ll need to stay in the city D≥(N+1)^2/2−1 days. So if the city has 10 restaurants you’ll need to stay in the city for at least 60 days: exploring the city for the first 10 days, and during the following 50 days going to the best restaurant. Please, don’t use these results to plan your next vacations.

Numerical

In the last section, we derived an analytical solution to our problem. Let’s now run some simulations to derive more intuition about this problem. In particular, it seems surprising that the solution doesn’t depend on N. So let’s see if the simulations support this claim.

With the following snippet, one can simulate the expected profit ⟨F⟩ for a set of parameters

import numpy as np

def expected_profit(n_days: int, n_restaurants: int, n_experiments=10000):
    """
    Computes the average profit at each 
    possible m \in range(1, n_days) over n_experiments.

    :param n_days: number of times one can visit the restaurant.
    :param n_restaurants: number of restaurants in the city.
    :param n_experiments: number of experiments to perform.
    """

    results = {}

    for m in range(1, n_days + 1):
        results[m] = []
        for i in range(n_experiments):
            ranks = x = list(range(n_restaurants))
            np.random.shuffle(ranks)
            profit = sum(ranks[:m]) + (n_days - m) * max(ranks[:m])
            results[m].append(profit)
    results = {k: np.mean(v) for k, v in results.items()}
    return results

Using this snippet we have generated the following plot, using N=(100, 110, 120) and D=10. Notice how the maximum of the three curves coincides, which gives support to the counterintuitive analytical results.

Expected profit vs the number of exploration days. Original work.

Conclusions

In this post, we have explored Feynman’s Restaurant Problem. First, we have derived an analytical solution for the optimal exploration strategy, and then we have checked the analytical results with some simulations. Although these results make sense from a mathematical point of view, no one in their right mind would follow the optimal strategy. This is probably caused by the unrealistic assumptions we made, ie: you can’t go to the same restaurant every day of your life without getting tired of it. One possible solution is to change the rating of a restaurant r_i to be dependent on the number of times you’ve visited it, ie r_i(n). However, this is outside the scope of this post and we’re not going to do it, but maybe it can serve as inspiration for another post.

Feynman’s Restaurant Problem was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.