Day 33: Reservoir sampling

Tomáš Bouda
100 days of algorithms
1 min readApr 26, 2017

Reservoir sampling is super useful when there is an endless stream of data and your goal is to grab a small sample with uniform probability.

The math behind is straightforward. Given a sample of size K with N items processed so far, the chance for any item to be selected is K/N. When the next item comes in, current sample has a chance to survive K/N*N/(N+1)=K/(N+1) while the new item has chance K/(N+1) to be selected.

https://github.com/coells/100days

algorithm

def reservoir_sampling(size):
i, sample = 0, []
while True:
item = yield i, sample

i += 1
k = np.random.randint(0, i)
if len(sample) < size:
sample.append(item)
elif k < size:
sample[k] = item

sampling

reservoir = reservoir_sampling(5)
next(reservoir)
for i in range(1000):
k, sample = reservoir.send(i)
if k % 100 == 0:
print(k, sample)
100 [15, 49, 54, 63, 60]
200 [180, 49, 54, 63, 197]
300 [218, 49, 54, 63, 197]
400 [218, 49, 54, 63, 197]
500 [218, 49, 54, 63, 197]
600 [519, 49, 54, 63, 197]
700 [691, 49, 54, 63, 197]
800 [691, 49, 54, 63, 197]
900 [691, 49, 54, 63, 197]
1000 [691, 49, 54, 63, 197]

--

--