But now the samples are less random and not independent.
Ankur Choudhary

Thanks for the response! Could you clafiry, from where do you see the independence violations coming?

Suppose we have dataset D = {d_i, i = 0 … N} where I is an index set. If we permute D taking uniform random samples s without replacements from I we will get a new index set I_p = {i_{p_1}, i_{p_2}, …, i_{p_N} | i\in I, p ~ Uniform(0, N)}.

Elements from D sequentially taken with indexes from I_p

  • Have the same distribution as D since we hanve not changed the dataset in any way
  • Are independent because each premuted index from I_p was taken from uniform distribution

If we will shuffle the array each time before sampling this won’t make the samples more random and independent than they already are. Almost all speedup in coming from the fact that we are generating pseudo ramdom sequence of indices beforehand and do not regenerate it when it’s unnecesary.

Under the hood Numpy uses function rk_interval when shuffling an array: this one just returns a pseudorandom number in given interval. This means that you can also simply get next element index for sampling from D as i ~ Uniform(0, N). This way is less efficient than shuffling the whole array in memory in advance.

Like what you read? Give Kirill Dubovikov a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.