Improving Site Performance and Reliability by Removing SQL Joins

Dan Applegate
Skillshare Writings
6 min readMay 13, 2016

For young companies trying to build a product towards market fit, iterative speed is key. You need to be able to build, test, and refine features quickly in order to outrace the competition (and your burn rate). In these situations, with limited personnel and short timelines, frameworks like Ruby on Rails, Django and Symfony can give you a huge leg up by automating much of the grunt work involved in building a dynamic website.

One of the key features of many popular frameworks is an Object Relational Mapping system, which provides a streamlined way to load and save data models to your persistence layer. Rather than having to write and maintain complex SQL statements by hand, developers can use an ORM to interact with their data layer using familiar code structures in their language of choice. Of course, as with any layer of abstraction, the same code that reduces the complexity of common tasks can also hide insidious problems that manifest as mysterious bugs and sudden performance degradations.

For teams using an ORM and attempting to scale, SQL statements that contain JOINs present a number of problems. They allow for writing queries that can only be executed using temporary tables, which can become massively disk-intensive. They complicate or even preclude the use of a sharded database architecture. They introduce headaches into ORM and caching code because the code must juggle table and column aliases in an increasing number of possible permutations. While ORMs can rapidly accelerate the early stages of building a product, the bulky JOINs they may introduce present a significant roadblock once the company is ready to scale.

In an effort to improve reliability, we undertook a series of projects to try to reduce the complexity of our queries. By breaking apart large, multi-table SELECTs and replacing them with primary-key lookups and simple statements, we were able to dramatically increase site uptime. Our database load is more consistent and our code is more maintainable. Collectively, we’ve begun to refer to these conventions and tools as “Model Hydration.”

The Problem With Joins

In a typical ORM that follows some variation of the ActiveRecord pattern, model classes are conceptually connected to each other by a set of relation definitions. A Post class would be defined as “belonging to” an Author, and might “have many” Comment models. Once these relationships are laid out in a configuration, the ORM can then combine and translate them into relevant SQL queries automatically. For example, requesting that an ORM retrieve a Post, its Author, and any associated Comments might result in the following query:

SELECT * 
FROM `post`
JOIN `author` ON `post`.`author_id` = `author`.`id`
LEFT OUTER JOIN `comment` ON `comment`.`post_id` = `post`.`id`

The ORM would then take the resulting data and separate them out into new model instances that each contain references to the other models it is related to. After the ORM does its thing, you would be able to access PostObject.author and get back the correct Author object.

This works well in this basic case, but things get complicated very quickly. What happens if you allow Authors to have many Comments and want to get them as well, together with the Post? This would involve JOINing the `comment` table again, and our ORM must now manage table and column aliases. Selecting multiple Posts and trying to sort them by the latest Comment is doable, but sets a deadly performance trap in the form of temporary tables written to disk.

Additionally, a common scaling tactic for tech companies that outgrow simpler database replication strategies is to shard their primary write database. When reads to your database are a bottleneck, it is fairly easy to set up read-only replicas and distribute the load of read requests across them. However, when writes become the problem, splitting your primary apart is a little trickier. For a traditional RDBMS like MySQL, you generally have to put some of your tables on one host and the rest on another. Naturally, this disallows for joining two tables that live on separate hosts. You then need to ensure that all tables which might be joined together live on the same host. Or you could give up on joins entirely.

Replacing Joins With Simpler Statements

To eliminate this complexity and ensure that our SQL statements are as straightforward as possible, we developed an internal alternative to these traditional ORMs. We call it the RelationHydrator. Although the hydrator still makes use of the ActiveRecord relation definitions, the underlying SQL it executes is broken up into single-table queries. Using our RelationHydrator, the request above would be made like this:

$post = Post::findByPk(100); // Post 100 written by Author 5
RelationHydrator::hydrate($post, [‘author’, ‘comments’]);
$post->author; // Returns the associated Author object with id 5
$post->comments; // Returns all associated Comment objects

Under the hood, the RelationHydrator uses the configuration attached to the Post class to plan out its execution. Here’s an approximation of what this configuration data looks like, in YAML:

Post:
author:
type: belongs_to
class: Author
foreignKey: author_id
comments:
type: has_many
class: Comment
foreignKey: post_id

By inspecting the input model and finding the ActiveRecord configurations for the requested related models, the RelationHydrator can execute simple SQL queries to “hydrate” the models with data from the database:

class RelationHydrator {
private $relation;
private $relationName;
private $models;
private $modelKeys;
public function __construct($_relationName, $_models) {
$this->relationName = $_relationName;
$this->models = $_models;
$this->relation = $this->getRelationObject($_relationName);
// Retrieve the primary keys of all of the base models
// for later use.
// These may be the foreign keys of the relation we’re
// trying to load.
$this->modelKeys = __::pluck($_models, “primaryKey”);
}
public function hydrate() {
// HasOne is just a special case of HasMany
$indexed = [];
if ($this->relation instanceof HasOneRelation
|| $this->relation instanceof HasManyRelation) {
$indexed = $this->hydrateHasSomeRelation();
} elseif ($this->relation instanceof BelongsToRelation) {
$indexed = $this->hydrateBelongsToRelation();
}
// $indexed now contains related models, indexed by PK // Sets the related models as properties on the base models
$this->setRelatedModels($indexed);
}
private function hydrateHasSomeRelation() {
// For these relations, related models will contain a
// foreign key that is the PK of the base model.
$relatedClass = $this->relation->getClass();
$foreignKey = $this->relation->getForeignKey();
$relatedModels = $relatedClass::findAllByAttribute(
$foreignKey,
$this->modelKeys
);
return $this->indexByPk($relatedModels);
}
private function hydrateBelongsToRelation() {
// For these relations, the "foreign key" is a property
// of the base model and the PK of the related model
$relatedClass = $this->relation->getClass();
$foreignKey = $this->relation->getForeignKey();
$relatedPks = __::pluck($this->models, $foreignKey);
$relatedModels = $relatedClass::findAllByPks($relatedPks);
return $this->indexByPk($relatedModels);
}
}

For the above hydration request, the RelationHydrator would then make these two separate queries.

// Called from RelationHydrator::hydrateBelongsToRelation
SELECT * FROM `author` WHERE `id` = 5;
// Called from RelationHydrator::hydrateHasSomeRelation
SELECT * FROM `comment` WHERE `post_id` = 100;

Assuming that the `author`.`id` and `comment`.`post_id` columns have appropriate indices, these queries will be extremely fast, even for very large data sets.

What’s next?

After implementing the RelationHydrator and using hydration throughout the site, we’ve found a dramatic improvement in the way our database handles load, along with a reduction in its volatility. Additionally, by removing all joins throughout our site, we’ve prepared ourselves for the time when we’d like to shard our database in order to handle increasing numbers of writes. Should we decide to place some of our tables on one primary host and others on a separate host, we can simply modify our database abstraction object to route these basic queries to whichever host contains the table in question.

We want to eventually open source our RelationHydrator code, as we feel that it could be a possible solution to a common problem that many tech startups face when starting to scale.

If you have any questions or suggestions about the RelationHydrator, or want to know more, let us know in the comments below. We’re also hiring, so if this sounds like the type of problem you’d enjoy tackling, check out our careers page!

--

--