Grouping API Searches by Similarity

This week I had to solve a rather interesting problem for one of our clients. For the client in question (who I will keep anonymous) we developed a single page app consisting of a map of Ireland and search box for postal addresses. The search tool was built as a technical preview of the clients data set for potential customers and end users to explore. In the spirit of the preview and to prevent scraping, a limit on the number of searches that can be preformed in a 24 hour period was proposed and accepted as a requirement for the project.

So the core problems to solve were:

  • Counting searches per IP address and storing this state.
  • Allowing multiple Search API instances to share quota information.
  • Determining which searches are related for fair accounting.

The Search API uses a combination of Node.js and Elasticsearch, so it made sense to develop a small module for the API to determine if a search could continue or not. Using the X-Forwarded field, getting the IP address was trivial and solved one half of the first problem. To solve the other half, I chose Redis to store the IP, count, time stamp and query information in a hash using a prefix and IP address as the key. The time stamp was stored as seconds since epoch to avoid parsing, adding and subtracting dates.

{
"ip": "127.0.0.1",
"query": "123 fake street",
"count": 0,
"timestamp": 1496422792
}

By using Redis, I also solved the second problem as multiple instances of the Search API could each connect, view and query the state of each IP’s quota at the same time. The final problem to solve would be the most interesting.

So how do you tell if one search string is related to another? As far as the Search API is concerned, each request is unrelated, but if we look closely we can see a pattern emerging:

742
742 E
742 Ev
742 Eve
742 Ever
742 Everg
742 Evergr
742 Evergre
742 Evergreen
742 Evergreen
742 Evergreen Te
742 Evergreen Terr
742 Evergreen Terrac
742 Evergreen Terrace

For related searches, each successive request could be viewed as a substring of the previous. Armed with this knowledge you could be tempted to run ahead and build some sort of .substring() comparison method, but a little maths can give us a far better solution, Damerau–Levenshtein distance.

The Damerau–Levenshtein distance algorithm takes two strings as input and returns a value from 0.0 to 1.0, indicating similarity (technically the number of changes to transform string A into string B but sufficient for our purpose). So to solve the third problem, we just need to compare the last search query from a given IP (saved in Redis) with the current one. If the similarity is above a certain threshold (0.5 is a nice starting value) we can assume the searches are related. If the similarity value drops, we count this as a new search and take away from the daily quota.

As the Search API was built with Node.js, a quick search of NPM gives us the brilliant damerau-levenshtein package. Combining this with the previous logic (and a bit of Redis magic) gives us the complete solution.

> dl = require("damerau-levenshtein");
[Function]
> dl("742 Evergre", "742 Evergreen Terrace");
{ steps: 10,
relative: 0.47619047619047616,
similarity: 0.5238095238095238 }
>