Approx Jaccard with Subset Sampling

Definition:

Jacc(A,B) f=|A intersect B| / |A union B|

Sk(x) = smalles k elems of X given hash h

|Y|/|X| =approx= 1/k * (Sk(x) intersect Sk(y) )

Given bottom k samples of A and B, construct bottom k of union as

Sk(A union B) = Sk( Sk(A) union Sk(B) )

Jaccc f =approx= 1/k * |Sk(A union B) intersect Sk(A) intersect Sk(B)|

Mikkel Thorup: Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence

Show your support

Clapping shows how much you appreciated Laura Dietz’s story.