Day 33: Reservoir sampling
Published in
1 min readApr 26, 2017
Reservoir sampling is super useful when there is an endless stream of data and your goal is to grab a small sample with uniform probability.
The math behind is straightforward. Given a sample of size K
with N
items processed so far, the chance for any item to be selected is K/N
. When the next item comes in, current sample has a chance to survive K/N*N/(N+1)=K/(N+1)
while the new item has chance K/(N+1)
to be selected.
https://github.com/coells/100days
algorithm
def reservoir_sampling(size):
i, sample = 0, [] while True:
item = yield i, sample
i += 1
k = np.random.randint(0, i) if len(sample) < size:
sample.append(item)
elif k < size:
sample[k] = item
sampling
reservoir = reservoir_sampling(5)
next(reservoir)for i in range(1000):
k, sample = reservoir.send(i)
if k % 100 == 0:
print(k, sample)100 [15, 49, 54, 63, 60]
200 [180, 49, 54, 63, 197]
300 [218, 49, 54, 63, 197]
400 [218, 49, 54, 63, 197]
500 [218, 49, 54, 63, 197]
600 [519, 49, 54, 63, 197]
700 [691, 49, 54, 63, 197]
800 [691, 49, 54, 63, 197]
900 [691, 49, 54, 63, 197]
1000 [691, 49, 54, 63, 197]