Redis, set, node.
Finding sets with the same members in Redis with Node.js
Usually, I can come up with a way to accomplish a given task in short order. The solution may not be perfect or efficient the first go-round, but I can get a workable solution. Except this week. This problem has been blocking me for days.
First, the problem. Let’s say you have a number of Redis sets.
SADD ap-one 1 2 3 4 5 6 7 8
SADD ap-two 1 2 3
SADD ap-three 1 2 3
SADD ap-four 1 2 3 4 5
SADD ap-five 8 9 10
The key’s for those sets are stored in yet another set:
SADD ap-keys ap-one ap-two ap-three ap-four ap-five
Using Redis, how do you find the duplicate sets from the keys listed in ap-keys? First, you need to break it down a bit — how do you even compare if two sets are equal?
Member equality on sets
Let’s address the second question before the first. My first thought, and likely yours, was SDIFF. So, let’s try it.
> SDIFF ap-one ap-two
Good, good. Now, what happens if you run the arguments the other way round?
> SDIFF ap-two ap-one
(empty list or set)
Okay. Let’s rewind. Here is a line from the official documentation for SDIFF:
Returns the members of the set resulting from the difference between the first set and all the successive sets.
It’s not really the members that the two sets lack in common, but the subtraction, so the order of the arguments matter. So, what you’re going to need to do is run both sides of the diff and ensure that they both are returning “(empty list or set)”. Fine / good. This is Redis, so we need to think about scale. Imagine a scenario where one set has 100,000 members and the other set has 999,999 members — you’re definitely not going to want to transport that wad of data if you don’t really need it. You could use SDIFFSTORE, then clean up your mess.
> SDIFFSTORE temp1 ap-two ap-three
> SDIFFSTORE temp2 ap-three ap-two
> DEL temp1 temp2
In your application you’ll just look at the result of the DEL and see if a result is “0” if the two sets have the same members. Don’t believe me? See how it looks when you try it with two different sets:
> SDIFFSTORE temp1 ap-one ap-two
> SDIFFSTORE temp2 ap-two ap-one
> DEL temp1 temp2
This is a bit counter-intuitive so let’s dive deeper. When you run a SDIFFSTORE you create a new set that is comprised of the subtraction of the second set from the first (see above) — if there is no difference then a new set will not be created. DEL returns the number of keys deleted, so in this case we can use this to infer if SDIFFSTORE either created anything. Got it? Good.
Actually, there is a little more that can be learned from that last DEL. We now know that a result of “0” means equal members. “1” means that one of the sets is a subset of the other. What does “2” mean? Well… nothing terribly useful. It just means that they have some members in common or none.
A moment to ponder performance
The above code is not terribly efficient. Let’s look at the time complexity of what is going on:
- SDIFF = O(n), with n being the number of members in both sets combined. You’ll be running this twice.
- DEL = O(n), n being the number of members in the deleted sets.
Let’s take a terrible example case:
- Two sets, bigSet1 and bigSet2. bigSet1 has 1,000,000 members and bigSet2 has 1,000,000 members. They have one member in common.
- The first SDIFF will take O(2,000,000) and the second SDIFF will take O(2,000,000).
- The final DEL will take O(999,999) as one set you’re deleting will have the difference and the other will be thankfully empty.
At the end of this you’re running O(4,999,999). 1.5 Million Op/Sec is something to brag about, so running this thing will take a while when you scale up to huge numbers. For my use, I’m looking to run it over sets with a few hundred members max, so the performance will still be acceptable.
Find the sets with duplicate members
Above we figured out how to find the sets that have the same members without pulling them into your app. Now, let’s take a look at how to find the duplicate sets for the keys listed in ap-keys. We’ll use Node.js in conjuction with Redis since there is no real way to do it all on the Redis server (sans some serious Lua).
Here is an overview of how we’ll approach the problem:
- Grab the members of ap-keys
- Find the combination pairs of the ap-keys members
- Run our above mentioned method to determine the equality of the sets
- Make a copy of the ap-keys and remove any duplicates if they are found
I’ll employ the usual suspects here — lodash, async, and node-redis. The pairwise function is a neat functional progamming implementation I found on Code Review. I’m sure there are more optimized implementations, but this one is nice an readable. The pairwise function will pair off the different combinations possible without repeating and without simple reversals. With our above data, it’ll look something like this JSON object:
After EXEC’ing the sequences, the results will come back with a series of numbers. This is a normal Redis “multi bulk” response. We don’t care so much about the 1n or 2n reply, the only important one is the 3n reply. See the bolded numbers below:
[ 0, 0, 0, 0, 5, 1, 0, 2, 1, 0, 5, 1, 0, 2, 1, 3, 0, 1 ]
So, we can use a for loop over the responses starting at the 2nd element (zero-based index) and incrementing by 3. If it is a zero response, we know we have member equality and we’ll push true into a new array, otherwise we’ll push false in.
To verify my own results, I built a simple tree view that shows the combinations and equality. It’s really just for debugging but the whole thing seems so opaque without it. Here is what the tree looks like:
(comments with values are mine)
Now that we have the member equality of all the combinations of sets, we can go to the array and find any already represented sets with different key names and eliminate them from the remaining keys. This will finally result in the keys of the unique member sets in our collection.
Here is the whole script: