Your Next Database-Lifehack: Efficient Pseudo-Random Sorting with Hashing

Fetch “random” data without compromising on query speed

Published in

ambient-digital

3 min readNov 20, 2023

Motivation

There are quite a few use-cases where you want to get your data in a non-determinisitc, meaning random, order. Think about a quiz where you’ll present the player with a next question independent of the creation date or the content.

Getting the database to shuffle your results is very time-consuming and might bring your system to its knees with only a few thousand records in your table.

Standard approach

Let’s say we have a simple table for storing our quiz questions. We have a primary key, a text containing the actual question and a timestamp storing the creation date(time).

CREATE TABLE question (
    `id` INT(11) NOT NULL AUTO_INCREMENT,
    `text` varchar(255) NOT NULL,
    `created_at` DATETIME NOT NULL,
     PRIMARY KEY (`id`) USING BTREE,
);

Using any MySQL derivate, you can tell the database to order randomly.

SELECT `text` FROM question ORDER BY RAND();

As stated above, this might be killing your database in no time, so we have to find an alternative.

Pseudo-randomness for the win

In most cases, having something that feels random totally suffices and going the extra mile for “real” randomness (if there is such a thing) isn’t necessary.

We’ll now add a new field where we’ll store a hash of the data. Make sure that you set an index for this field. You won’t find a more suitable use-case for indexing a field.

CREATE TABLE question (
    `id` INT(11) NOT NULL AUTO_INCREMENT,
    `text` varchar(255) NOT NULL,
    `created_at` DATETIME NOT NULL,
    `sorting_hash` varchar(32) NOT NULL,
    PRIMARY KEY (`id`) USING BTREE,
    INDEX `question_sorting_hash` (`sorting_hash`) USING BTREE,
);

Next, we’ll have to take care that we hash all relevant values when creating a record. Note that we use a “fast” hashing algorithm like md5 because security is not an issue and the faster we get it done, the better.

question_text = "What is the Ultimate Question of Life, the Universe, and Everything?"
hash_object = hashlib.md5(question_text.encode())
hashed_text = hash_object.hexdigest()

Now you can sort by this field and your user will have the feeling that they get random data — without waiting for the database for ages.

SELECT `text` FROM question ORDER BY `sorting_hash`;

Improvements

To add a little more salt (randomness), you can add the timestamp of the record to the hash input. This way, you ensure that your hashes are even further disconnected from the original input.
Ensure that you can’t miss creating the hash in your application. Working with Django, a pre-save signal might be a wise choice.

Critical review

As shown above, using a hashing field for sorting your data is a fast and simple approach. It’s fast (if you set the index) but you have to remember to generate the hash during create and update queries.

You pay for speed and convenience with extra data in your database. Since storage is usually cheap, this shouldn’t be an issue in most scenarios. Nevertheless, you might bloat your tables so — as always — think before implementing this pattern.

The result is actually deterministic. Getting the same order every time you query your data is another drawback of this approach, though in the cases I worked on, it was never an issue.

Thanks for reading and I’d be happy about feedback on my approach.