Reservior Sampling

Keeping fixed-size random samples of streams

Jesse Bridgewater
3AM Data Questions

--

This is a cute algorithm that you do not often need to use but does come in handy once in a while. If you have a fixed size sample budget (1000 pageview, tweets, etc) and a large stream of data that you want to sample (100,000,000 queries per day). You could wait to see the whole data and load it into memory and take a simple random sample of the data. But this eventually becomes impractical as the data gets larger. With this approach you never need more memory than your sample requires but you still get a statistically valid sample at every step.

Key ideas:

  1. Add the first K element of your stream to the sample
  2. For the rest of the stream you replace a random element of the sample with the current element proportionally to ratio of the sample size to the current size of the stream.

This is concisely summarized with the following R code. R is a crazy language for a reservoir sample, but I’ll leave that for another post.

K <- 500
N <- 10000
# sort the normal random variable
# sampling does not care which observations come first
data <- sort(rnorm(N))
res <- numeric(0)
for (i in 1:K) {
res <- c(res, data[i])
}
for (i in (K + 1):N) {
rn <- sample.int(i, 1)
if (rn <= K)
res[rn] <- data[i]
}

The output of the reservoir sample is normal as seen on the q-q plot of the sample vs the normal distribution.

Depending on your needs you might want to hash rather than sample. Oscar Boykin has a nice talk with that as a theme.

The input data is normal, but I sort it before I feed it to the reservoir sampler just to show that the reservoir does not care about the order of the events and still produces a normal sample.

--

--