Day 33: Reservoir sampling

Published in

100 days of algorithms

1 min readApr 26, 2017

Reservoir sampling is super useful when there is an endless stream of data and your goal is to grab a small sample with uniform probability.

The math behind is straightforward. Given a sample of size K with N items processed so far, the chance for any item to be selected is K/N. When the next item comes in, current sample has a chance to survive K/N*N/(N+1)=K/(N+1) while the new item has chance K/(N+1) to be selected.

https://github.com/coells/100days

algorithm

def reservoir_sampling(size):
    i, sample = 0, []    while True:
        item = yield i, sample
        
        i += 1
        k = np.random.randint(0, i)        if len(sample) < size:
            sample.append(item)
        elif k < size:
            sample[k] = item

sampling

reservoir = reservoir_sampling(5)
next(reservoir)for i in range(1000):
    k, sample = reservoir.send(i)
    if k % 100 == 0:
        print(k, sample)100 [15, 49, 54, 63, 60]
200 [180, 49, 54, 63, 197]
300 [218, 49, 54, 63, 197]
400 [218, 49, 54, 63, 197]
500 [218, 49, 54, 63, 197]
600 [519, 49, 54, 63, 197]
700 [691, 49, 54, 63, 197]
800 [691, 49, 54, 63, 197]
900 [691, 49, 54, 63, 197]
1000 [691, 49, 54, 63, 197]

Day 33: Reservoir sampling

algorithm

sampling

Written by Tomáš Bouda