Working at Storefront, we’re always excited about new ways we can keep our users engaged. Recommendations or suggestions are a fantastic way for a platform to encourage users to stick around and keep browsing. The problem is, recommendations can be tricky.
How do we know what to recommend to our customers? Is it similar item descriptions? Colors? Locations? It could be anything, and a linear combination of parameters grows significantly in complexity with every additional variable. If we use SQL queries, they can quickly become unmanageable tangled messes of JOINs. These can that take minutes to hours to generate unique recommendations for each and every user across your entire user base.
With a very small team and limited resources we wanted a robust recommendation engine that A) wouldn’t lead users astray (random recommendations were out of the question), B) was extremely performant and C) worked with our current stack (node.js). What was our best option to accomplish this?
Introducing the Graph
If you have a background in computer science or mathematics, you’re probably familiar with graph theory. If not, you stumbled upon the right article to acquaint yourself with it! Many big companies are using graph-theoretic data structures to keep you hooked. LinkedIn’s “how you’re connected.” Facebook’s “people you may know.” You don’t need a lot of resources to begin working with graphs and they can be tremendously useful.
Graphs, quite simply, are a way to model complex relationships between many objects. A graph exists as a collection of nodes (also known as vertices) and edges. A node is merely an abstract data point — it could represent anything. A person, a computer, a building, an intersection… or whatever you’d like. An edge connects two nodes and can optionally be directional (information only flows one way).
Here’s an example of a simple graph:
Node[Joe] : Edge[likes] → Node[Minecraft]
In this case, we have two nodes (Joe and Minecraft) and one directional edge (likes) connecting the two. What does this look like in code?
Now we can see that joe.output references likes, likes.input and likes.output reference joe and minecraft respectively (directionally), and minecraft.input references likes.
Creating Our Data Structures
There’s a problem here, though, and it’s glaringly obvious: if we want our graph to be assembled effectively, we have to write at least four separate instructions to connect two nodes via an edge. Not only that, but our node and edge constructors (via object literals) are pretty heavy. We can simplify this with object orientation. Let’s introduce some prototypes via the ES6 Class semantics.
First, we’ll define a base prototype for shared behavior between Nodes and Edges. Let’s call it a Unit, because that’s easy. We want our Unit to be able to specify an entity (what classification of object is it?) and have properties and methods related to setting and retrieving these properties.
Great! That’s a good start. Now we need to define our Node where we keep track of all edges, input edges, and output edges (for directionality). We also specify an unlink method to remove all connected edges.
Pretty lightweight, but there’s a bunch of functionality we’ll need to include in our Edge that takes care of connecting nodes to one another. In our edge we hold an inputNode, an outputNode and whether or not the edge is a duplex link (bi-directional).
Re-implementing our Graph
We can now re-implement our “Joe likes Minecraft” graph as follows:
We added a bit, so the graph now looks something like:
N[Joe] : E[likes] → N[Minecraft] ← E[created] : N[Notch]
We can increase the complexity of our graph (so it isn’t just linear!) as much as we want.
Awesome. But how would we use a structure like this to drive recommendations for our users? Let’s switch gears to apply this to our business and our product.
Mapping User Behavior
Now that we have the basic data structures of our graph set up, let’s map our users’ behavior on to a graph. At Storefront, we have at least three basic ways user can interact with a listing: they can A) view a listing, B) favorite a listing or C) request to book a listing. We’ll use these three actions as our edges, and we’ll define two node types, user and listing.
We can assemble our graph like so:
Voila! Keep in mind, this graph is represented entirely in memory (RAM). For very large user bases (and large numbers of actions) this can be quite a memory hog. You can decrease the memory space by grouping all actions of a certain type between a user and a listing into a single edge, and ignoring users who haven’t performed actions on your website.
Before we’re ready to start giving out recommendations, we need to assign weights (or distances) to each of our user behaviors. We want more important actions to be weighted more favorably. The reason why will become clear when we put everything together. “Weight” and “distance” are inverses of each other, with a higher weight equating to a lower distance.
We’ll deal with setting distances because they’re easier to reason about when traversing graphs. Since requests are the most important, we’ll set them to a distance of 1, favorites to 2, and views to 4.
Finally, we’re ready to start driving recommendations!
Putting it All Together
In order to gain meaningful insights from our graph we’ll need to start exploring it. Why? How does this provide us with recommendations? Well, what we’re going to do is use the emergent behavior of our users to tell us what listings are related to one another. Compared to pulling a bunch of levers and adjusting SQL query parameters, we’re just going to make the assumption that two listings are probably similar to each other if users have interacted with both of them in a meaningful way. The more “meaningful” that interaction, the more likely the listings are similar.
This explains our distances (weights) of user interactions above. If User 1 requests Listing A, User 2 requests Listing A and Listing B, then User 1 is a total distance of 3 away from Listing B on our graph. Let’s visualize.
User 1 : Req (1) → Listing A ← Req (1) : User 2 : Req (1) → Listing B
The distance (3) is calculated as the total length of edges between the two nodes (request + request + request). What we’re going to want to do to generate recommendations is start traversing our graph outwards from our user, and find all of the closest listings in the order that they appear.
Using this algorithm we can keep track of every node we find along the way, ordered by distance. As we continue, nodes are progressively further away and (hypothetically) less relevant. For example, we’re interested in the listings another user has requested if they’ve requested the same space as you in the past. We’re not nearly as interested in the listings they have viewed (which is why the distance for views is 4). Because the listings they have viewed (as opposed to requested or favorited) are further out, they’ll be traversed far later than the listings they have requested and won’t have as high of a precedence in our list of recommended spaces.
I’ll save you the hassle of mastering Dijkstra’s Algorithm in this article, but I encourage you to investigate more on your own if you’re not familiar with graph traversal or need a refresher.
If you were worried you’ll have to figure this all out on your own, we have you covered. UnitGraph is a lightweight io.js (node.js) graph library that uses the data structures outlined here, so you don’t have to make it on your own. It provides a Graph wrapper that allows you to find the closest nodes on a graph or trace the shortest route between two nodes and it’s what we’re currently using in production.
Generating our graph now looks like the following:
And here’s an example of querying for the 100 closest listings to a user on the graph:
You’ll notice minDepth in the graph.closest() call is set to 3. This is to prevent recommending listings to a user that the user has already favorited or requested (distances of 1 and 2, respectively). A distance of at least 3 guarantees we’ll only be recommending listings to users that they’ve at least already viewed.
With UnitGraph we’re able to effectively generate recommendations for our entire user base in <1ms per user on average. It’s fast. It gets the job done. The heaviest operations are generating the graph from our database (scheduled nightly) and saving / loading the graph to / from hard storage. In the week we’ve been running recommended and related listings, we’ve seen a 5–10% boost in listing views attributed to the recommendation engine alone.
Using a graph to generate our recommendations for us has saved us a considerable amount of engineering resources. Building a small library in-house that integrates with our current stack took a relatively short amount of time, less than two weeks from conception to production integration. The wisdom of the crowd (our users) is incorporated into the graph by design, and this modifies our recommendation engine for us as we onboard new listings and users. It’s self-adjusting, and that’s the best thing we can ask for with a small team. This saves us from having to modify engine parameters manually.
The solution isn’t perfect, it will not always get recommendations right (especially for users with behavior divergent from the average), but it is a great step for a V1 that we can sit on for a while (tackling other low-hanging fruit) before we worry about optimization.
“But,” a concerned developer may ask, “will it scale?” The answer is that we’ll cross that road when we come to it. We certainly have room to keep it operational and vertically scale for the foreseeable future. There’s a good chance we’ll need more in-depth, mature tooling as we grow (perhaps Neo4j), but the goal of this article is to share our experiences managing resources and building a tool with a good ROI specific to earlier-stage startups.
Thanks for taking the time to read, I hope our work is helpful. If you have any questions or comments, you can reach me on Twitter (@keithwhor), check out my personal website, or shout “hey” if you ever see me walking the streets of the beautiful San Francisco.
Finally, I would like to extend a special thanks to our engineering team at Storefront: Scott Gamble, Kelly Mothershaw, Trevor Strieber, Arthur Thornton and Jules Walter for their awesome work. This article wouldn’t be possible without them.