Road to Redis — Chapter 2

One to many relationships

Kyle
12 min readMar 31, 2017

Note: This series was originally titled “From SQL to Redis.” You can read about why it’s been changed to “Road to Redis” here.

This is the second part in the Road to Redis sub-series of the Node / Redis Series. You should read the previous part first, From Table to Hash.

Relationships between data is an important concept. The key to this is that data will be repeated throughout a dataset and it’s both wasteful and risky to repeat the data. Take our e-commerce example. In an e-commerce setting you are likely to sell items from a number of different manufacturers. The information about the manufacturer (name, address, phone number, etc.) will be the same for every item. Even if that data is only a small number of bytes, as you scale up by items, this could become a lot of wasted memory/storage. Perhaps more dubious, is maintaining all this extra data if something changes. Let’s say that the fax number of a company changes — you would have to touch each item made by that manufacturer and update those values, which is a pain. You can make this easier with a one-to-many relationship.

First, looking at our e-commerce example, let’s examine our companies table in a SQL environment.

CREATE TABLE `companies` (
`slug` varchar(100) NOT NULL,
`name` varchar(100) NOT NULL,
`address` varchar(100) NOT NULL,
`city` varchar(100) NOT NULL,
`state` varchar(2) NOT NULL,
`zip` varchar(5) NOT NULL,
`account` varchar(10) NOT NULL,
`contact` varchar(100) NOT NULL
);
ALTER TABLE `companies` ADD UNIQUE KEY `slug` (`slug`);

Straight forward. Everything here is a varchar and is required. There is a single unique key, slug. The slug is also used by the items table to identify the manufacturer of the item.

To get the manufacturer data for each item in the items table with SQL you’ll need a join. Without getting deep into specific syntaxes, you can do an implicit join query:

SELECT 
`items`.`artist`,
`items`.`price`,
`items`.`manufacturer`,
`companies`.`name`,
`companies`.`address`,
`companies`.`city`,
`companies`.`state`,
`companies`.`zip`
FROM `items`, `companies`
WHERE`items`.`manufacturer` = `companies`.`slug`

Running this query will get all the items with their associated manufacturer’s data.

On to Redis. Let’s assume that we have our data for the companies stored in a hash and set much like we did in chapter 1 with the items (redishop:companies:[company-slug]). In addition, we’ll have all the company slugs stored in a Redis set (redishop:all-manufacturer).

Now that we’ve got the setup out of the way, let’s talk about what’s great about Redis. With a database leveraging a SQL syntax, you’re abstracted from much of what goes on below the surface of your query. With Redis, you have access to more primitive storage commands that you can bend in many different ways. That being said, let’s explore a few paths of how to accomplish our one-to-many relationship.

Path 1: SORT’ing out the details

In chapter 1, we looked at the many powerful ways of using the SORT command. Now, let’s take it a step further. The SORT command can not only return results but can also store them. This is the aptly named STORE sub-command. The STORE sub-command comes at the end of the SORT command and will take the output of the command and store it as a list type. We’ve yet to explore the Redis list data type but they are basically linked list. Another fun fact about the list data type is that the list itself can be sorted. Let’s see what we can do with this.

> SORT redishop:all-items BY redishop:items:*->price GET # GET redishop:items:*->price GET redishop:items:*->name

This is the same as in chapter 1. Doing a similar SORT call, but this time with items:

> SORT redishop:all-items BY redishop:items:*->price GET redishop:items:*->manufacturer STORE temp-companies

Can you see what’s happening here? I’m still sorting the redishop:all-items set by price but I’m only getting the manufacturer slug and I’m storing it at the key temp-companies. So, now temp-companies is a list of item slugs ordered by the price of the item.

With this information, we can then SORT again based on the newly created temp-companies list. There is a fascinating loophole in SORT — you don’t need to sort anything with sort. Stay with me — you can use SORT to just get the value of keys with the GET sub-command. All you need to do is request the BY of a field that doesn’t exist. By convention, we usually use nosort for this (clear, no?). It looks like this:

> SORT temp-companies BY nosort GET redishop:companies:*->name GET redishop:companies:*->address

This will return the company name and address sorted by the price of the item. So, in essence, the first SORT command and this last SORT command are producing coordinated arrays — the 4th “item” result will correspond with the 4th “company” result.

Since this is depending on coordinating two results, it’s best to contain this all into a MULTI/EXEC block. It would look something like this:

> MULTI
> SORT redishop:all-items BY redishop:items:*->price GET # GET redishop:items:*->price GET redishop:items:*->name
> SORT redishop:all-items BY redishop:items:*->price GET redishop:items:*->manufacturer STORE temp-companies
> SORT temp-companies BY nosort GET redishop:companies:*->name GET redishop:companies:*->address
> DEL temp-companies
> EXEC

This will return 4 results, of which we’re really only interested in the 1st and 3rd. You’ll also need to untangle your results — the 1st will need to be chunked by 3 (slug, price, name) and the 3rd result by 2 (company name and company address).

All together in a script it looks like this:

Path 2: A trip to the STORE

This happens all fairly quickly, but can we do anything else to make the process more efficient? Sure! In an e-commerce situation, this will be a very common request — imagine a shopper looking at your items sorted by price. In a busy store you could easily be running this 1,000x a day (or an hour!). You’re likely doing all this sorting over and over for data that hasn’t changed. What a waste!

Let’s say that any place we make a change to the items available (an item is added, updated, or removed) we pre-SORT and STORE’d these indexes. Then we just have to grab the values we need. It would add a small amount of time to the processing required for adding/removing items, but it would reduced for the much more frequent viewing action. I like to think about this as “Write Hard, Read Easy.”

Let’s see how we would do it. First, we need to change how we are adding items. If you’re following along, it would be a good time to FLUSHDB and clear everything out. This script is largely the same as before, aside from adding in the SORT command. The SORT command takes the zset and converts the item slug list into an list of company slugs (still sorted by the price of the item).

Since all this is happening inside of a MULTI/EXEC block, then we know that nothing can interrupt the sorting and everything will stay correlated. How does this affect the way we grab our items? We can dispense with the sorting at the time of viewing and just do a couple of nosort SORTs to get the foreign key values.

> MULTI
> SORT redishop:priceIndex BY nosort GET # GET redishop:items:*->price GET redishop:items:*->name
> SORT redishop:sorted:items-manufacturer BY nosort GET redishop:companies:*->name GET redishop:companies:*->address
> EXEC

Computationally and syntactically simpler as well as optimized for more frequent activities when compared to the first path. However, we are creating and maintaining an index, so it is a constant storage overhead.

Getting the items this way can be accomplished like this:

Path 3: Spider Osgood

So far, we’ve taken a look at using SORT to get the items. This is a fine method of arriving at your data, however it actually might not be the way I would go about it.

A while back, I was responsible for code quality on a particular platform. This platform had quite a few coders working on it at one time, each in their own mini-projects. The coders, who came from a range of backgrounds and skill levels, would finish a project in about 8–12 hours of solo Javascript programming. After they finished up, they’d submit it to me for a code review. It was jarring to see what was coming out of these programmers — since they were not working in teams, the code was all over the place. Each having it’s own quirks, strategies, and “smells.” When going over the code reviews, I started asking people about how they learned to program. Some were self-taught, others had classical CS backgrounds and most of them came from some other language to Javascript (PHP, C#, Java, etc.). I started realizing that some of the strangest code came from people were really good at one language and then came to Javascript recently. They were, in effect, trying to write PHP (or whatever) in Javascript and yielding some hair raising solutions. Given the topic of this series, maybe it’s best to take a step back and start trying to not solve the problem in precisely the same way as you would in SQL.

It’s important to keep in mind that Redis operates, with a few exceptions, as a single threaded process. This means it can only do one command at a time — which simplifies a great number of things but it also means that long running commands are not ideal as it will block other clients from accessing the data. Making many, short requests keeps things running smoothly. The amazing cook in this video, Spider Osgood, is the embodiment of the right way to treat Redis. I encourage you to watch the video all the way to the end, it’s actually amazing:

The 2 egg crack at 2:40 is clearly in a MULTI/EXEC block

Notice how he never stops moving and is doing everything at lightening speed. If he had to sit and complete a single plate of food his total plates finished (throughput) would tank. How can our application be more like Spider?

To be more like Mr. Osgood, we should re-think our approach. We really have three steps in getting a list of products from an e-commerce store:

  1. Find the items in a particular way (sorted, in a range, etc),
  2. Get the details about those items,
  3. Get the details about the manufacturer related to the item.

Since an item’s manufacturer is located as a field/value in a hash, this would be a synchronous waterfall: step 3 depends on step 2 which depends on step 1. At first sniff, you might turn your nose up at this — working on the app level, you need to wait for transmission of your data which adds to the overall response time. However, this actually gives us two advantages: it’s possible to save quite a bit of computation and data transmission as their could be far fewer manufacturers than items — it’s possible that every item in a given list of items has the same manufacturer. The other advantage is that having the transmit time between consecutive calls can actually result in a more responsive server than larger calls as it doesn’t monopolize the Redis server for very long. The disadvantage is that we no longer have an atomic result, so we need to account for that on the app side in some way.

From a modularity standpoint you can also gain quite a bit. Let’s say we expose each one of these parts as a HTTP endpoint. Step 1 gets the item slugs, Step 2 gets details about item(s) and Step 3 gets details about manufacturer(s). Step 2 and Step 3 are fairly simple operations — just HMGET(s) at the provided keys. Actually, this brings up an awesome difference between Redis and SQL databases. Injection attacks are not possible in Redis. So we can write very compact, simple interfaces that can be exposed almost directly to untrusted sources (aka users). Step 2 and Step 3 are not tied to a particular view so they’re very reusable.

Let’s build out a Node.js server that would work this way — we’ll use Express.js.

To start the server, run:

$ node app.node.js --credentials ~/path/to/your/credentials.json

This server has two “routes” one that can get a list of slugs and another that can get all the values of a hash. We can utilize redishop:priceIndex and ZRANGE along with start/end for pagination.

http://localhost:5599/api/items-by-price/0

The only part of the URL that varies is after the last slash. The example above will get slugs for items 0 to 99 (ordered by price, low to high). The response is a JSON object that has the start and end as well as the array of slugs:

The other route gets all the fields and values from 1 or more hash keys by their slugs using the HGETALL command. A really cool feature of node_redis is that it will automatically convert HGETALL and HMGET from the alternating field/value array format into a Javascript object. You specify the keys you want in the body of the HTTP request. You can’t easily request thie file from a standard browser, but from cURL you can run the command:

$ curl -H "Content-Type: application/json" -X POST -d '["Konopelski-Group","Bayer-and-Sons"]' http://localhost:5599/api/companies

Note about HTTP verbs: To make this a true REST API you wouldn’t use a POST request it would be a GET request. I’m using a POST here because I want to retrieve more than one item and I need to specify the slugs in the body, thus the use of POST instead of GET.

Let’s bring this all together into something usable. You could use any client side MVC framework for this type of site, but I’m using Angular.js (1.x) because it’s readable, illustrates the point and doesn’t require much tooling. In this working example, we have three main functions residing in our controller: getItemsByPrice, getDetailOfItems, and getDetailOfCompanies. The names are pretty self explanatory but let’s take each one:

  • getItemsByPrice gets the slugs of the items for a given offset. It does a HTTP request to /api/items-by-price/[offset] , which returns a lists of 100 slugs in JSON form as well as the start and end bounds. The results are immediately sent to getDetailOfItems
  • getDetailOfItems accepts the slugs of 1 or more items and does an HTTP POST to /api/items with the required slugs as a JSON array in the body. The server returns all the data about each item in an array. It doesn’t actually include the slugs, so the slugs and item details are combined by index using map. This still doesn’t have information about the companies though, so the company slugs are extracted into an Object and then keys are returned (client-side Javascript has poor support for set constructs, so this is often the easiest way to get an array with unique values). The company slugs are then sent to the getDetailOfCompany function. The items array is assigned to the Angular $scope to make it available to the template.
  • getDetailOfCompanies accepts the slugs of 1 or more companies and does an HTTP POST to /api/companies, which is actually the same Express route as /api/items since the structure is the same. This is in the same sequence as the slugs, so they are matched and converted into an object with the properties being the slugs and the values being the details of each company. The object created here is assigned to $scope to make it available to the template.

So far, we’ve not yet “joined” our company and item details. This is actually accomplished in the template, which might seem very odd. Consider how the template is being rendered:

  • The items are iterated over. At the time of rendering, the company data may or may not be in, but it knows the slug of the company and where the company data is/will be.
  • The company data is no larger than the number of items displayed per page and can be easily access as it’s just a property of an Object. No sorting or duplicate data exists.

What’s really compelling about setting up a site like this is the flexibility and reusability of the code. In our example, we’re getting the items sorted by the price from low to high. Let’s say you wanted to sort by your own relevance algorithm, all you would need to do is replace the getItemsByPrice function (and the related Express route) with something that returned the same format. The rest of your code would stay the same. Similarly, if you wanted to include another “related” bit of data per item, the pattern used for the company data could be extended in may different ways.

To play with the rudimentary e-commerce platform make sure the app.node.js file is running and go to http://localhost:5599/

Evaluating your best path

We’ve seen three different paths in relating data in our e-commerce example. Certainly, each one has advantages and pitfalls:

Path 1: Sort without pre-sorted items
Advantages: Straight-forward data insertion, atomic, low storage overhead.
Pitfalls: Slower to get results, CPU intensive to get results, could tie up Redis server while getting results.
Suited for: Infrequent reports where atomic operations are required.

Path 2: Sort with pre-sorted items
Advantages: Relatively quick to get results, atomic.
Pitfalls: Slower to insert data, CPU intensive to insert data, storage overhead for pre-sorted lists, could tie up Redis server while adding items.
Suited for: Frequent reports where atomic operations are required, situations where memory overhead is not an issue.

Path 3: Client/server optimized
Advantages: Flexible, reusable code, unlikely to tie up Redis server, minimal server-side app-level manipulation, low storage overhead, can shift load off servers on to web clients, potential for manual partitioning.
Pitfalls: Higher latency, non-atomic, unsuitable for many situations, larger transfer requirements.
Suited for: Websites with client-side MVC frameworks.

Moving from the SQL world to Redis can be a daunting task when you start thinking about relating your data. However, evaluating the various techniques and using them as-needed allows for some very precise tuning. Indeed, breaking out of the SQL box puts you closer to your data and enables novel patterns that would be impractical or impossible otherwise.

The full source code for this piece and series is available on Github.

--

--

Kyle

Developer of things. Node.js + all the frontend jazz. Also, not from Stockholm, don’t do UX. Long story.