Purrr functionality -Case study #2: Iterating over multiple trials.

Julian Tagell
4 min readMay 7, 2019

--

The birthday problem

Put simply:

What is the chance that two people will have the same birthday… for any n-sized group of people?

Okay, I had heard about this question for many years… but I recently read about it in an article and thought, “hey, I just reckon I might have the R chops to figure out a solution to this!”

And by this I don’t mean the “proper way” by working it out via the combinatrix / algebra method. I mean running a series of trials. Is this what people call Monte Carlo experiments? I’ve always been confused by this.

Let’s get started by thinking about a group of 30 people…

library(tidyverse)
size_of_group <- 30

The first test:

set.seed(123) # for reproducibilitytibble(person_id = 1:size_of_group) %>%
mutate(birthday = sample(1:365, n(), replace = T)) %>%
mutate(other_dates = list(.)) %>%
unnest(other_dates) %>%
filter(person_id != person_id1) %>%
filter(birthday == birthday1) %>%
nrow() > 0

Line by line:

tibble(person_id = 1:size_of_group) %>%

each row is a person

mutate(birthday = sample(1:365, n(), replace = T)) %>%

each person gets a random birthday

mutate(other_dates = list(.)) %>%

take the whole dataframe and plonk it into a cell for each row

unnest(other_dates) %>%

expands every cell of this new nested data frame column (so we have a row for each person pair combination)

filter(person_id != person_id1) %>%

let’s not count when each person is matched against themself

filter(birthday == birthday1) %>% 
nrow() > 0

only keep the birthday matches… and “do we have any rows left?” If so, Hazaar! we had a match!.. how wonderful!? If not, oh well, what will be will be…

Now, let’s be smart cookies and turn this single trial process into a function:

test_if_sameness <- function(size_of_group) {tibble(person_id = 1:size_of_group) %>% 
mutate(birthday = sample(1:365, n(), replace = T)) %>%
mutate(other_dates = list(.)) %>%
unnest(other_dates) %>%
filter(person_id != person_id1) %>%
filter(birthday == birthday1) %>%
nrow() > 0
}

Okay let’s set up our experiment parameters.

We want to test all sizes of groups, from 2 up to 60 people

size_of_group <- 2:60

and instead of just running one trial, let’s run a whole heap

number_of_trials <- 100

Let’s put these parameters all together using the crossing() function.

iterations_table <- tibble(group_size = sizes_of_group) %>% 
crossing(trial_id = 1:number_of_trials)

Then we want to run our function for each trial of the different sized groups… and for that we can use purrr::map.

Because we know that the outcome of our function is either a TRUE or FALSE, we can use map_lgl… and then follow that up with a summarisation…

iterations_table <- iterations_table %>%
mutate(outcome = map_lgl(group_size, ~test_if_sameness(.x))) %>%
group_by(group_size) %>%
summarise(count_of_success = sum(outcome),
percent_success = count_of_success/number_of_trials)

Nice one. Let’s chart our results!

iterations_table %>% 
ggplot() +
aes(group_size, percent_success) +
geom_line(stat = "identity")

Groovy! We’ve got a bit of a picture of what happens…

but this curve is spiking and undulating because the number of trials for each group size isn’t sufficient to weed out those anomalous “lucky” and “unlucky” runs in our trials.

As an interview questions, I think the usual challenge is to determine at which group size you will have a “better than even” chance of the group having a matching pair.

But we have a problem. Wrapping the above trial process in tictoc::tic() and tictoc::toc() we find it takes ~22.5 seconds. We know our 100 trials per group size is not sufficient to give us a totally reliable percentage indicator.

If we boost our number of trials (per group size) to 1,000 we can probably expect the process to take 10 times as long (or ~3.75 minutes). It does. I checked. But this is too long (I hate to imagine what a for loop takes).

So instead we are going to try out the furrr package, which apparently does the mapping operations in parallel rather than in sequence.

library(furrr)
future::plan(multiprocess)
sample <- tibble(group_size = sizes_of_group) %>%
crossing(trial_id = 1:number_of_trials) %>%
mutate(outcome = furrr::future_map_lgl(group_size, ~test_if_sameness(.x))) %>%
group_by(group_size) %>%
summarise(count_of_success = sum(outcome),
percent_success = count_of_success/number_of_trials)

And it is faster (takes only 67 seconds)… and the line is much smoother. Great!

Well, almost great. I know (because I cheated) that the answer is actually 23. And for this latest 1000 trial process… that’s not the answer we got.

So maybe we could run an experiment only focusing on those group sizes that are close to 23… and to up the number_of_trials to something crazy… Or conversely, we could try and understand the algebraic solution… but this is a purrr::map blog post, not a probability blog post.

Hope this has been useful. Cheers

--

--