Overthinking it and the value of simple solutions (2019)

6 min readJul 17, 2023

Note: this article was originally published on October 8, 2019. My original blog was decomissioned after some time, so I’m reposting here. There’s also a Hacker News discussion about it.

Overthinking it and the value of simple solutions

J-Zone is one of hip-hop’s and funk’s best kept secrets. I really enjoyed this video interview he did a few years ago, that unfortunately has been made private.

There was one part that really standout for me — not only as a bedroom musician but also as a software engineer was the idea of using the simplest tools possible and not getting overwhelmed by the technology.

It’s something that keeps getting repeated even though we’re very much in the age of the resume driven development.

To add to that, I’m guilty of making bad technical decisions in the past. They weren’t small either — for example picking RethinkDB as one of the data stores at EnjoyHQ, took a considerable amount of time to undo and migrate out of.

When our team was faced with solving a particular architectural challenge that was impacting us for a long time — we chose to take a different approach than just building stuff: step back, evaluate potential solutions and pick the simplest possible approach.

Repeating yourself in the era of distributed systems

Here’s the problem: our backend is composed of 10+ Clojure services. Around 70% of them have to run work on a fixed schedule (send account digest emails, fetch new data from a 3rd party service, do data cleanups etc). Think built-in Cron. Given that not everything can be made idempotent, and we’re strong believers of having more than one of everything we had to come up with a way of ensuring that only one scheduler runs across all instances of a given service but without having a single point of failure.

For a long time we have worked around this problem by introducing a configuration setting which would instruct one of the instance of a service that it’s the designated scheduler. This is not ideal because:

it ties service operation to how it’s deployed
it’s a single point of failure
if a deploy happens at the same time when a scheduler is supposed to run, we might miss the “tick”

But we put up with it despite these pitfalls, because this setup just worked. But after one-to-many issues we had to go back to the drawing board.

We used an async process for introducing, discussing and solving engineering problems

RFCs

We use an async process for introducing and discussing approaches of solving a particular engineering problem.

As a fully distributed team, we cannot organize ad-hoc meetings any time we want, therefore we have adopted an RFC (Request For Comments/Change) process in which the author writes a document which states the problem, possible solutions, evaluation of attempts, pitfalls/cons and the conclusion. The RFC process has been quite popular — and many companies and open source projects have used it to a great success.

For the scheduler problem it took 3 RFC documents and testing the following approaches:

using Redis as a lock service
using Zookeeper, Consul and jGroups as the basis of the leader election mechanism
introducing a new service, which would be a centralized, distributed scheduler orchestrating jobs over RabbitMQ, similar to Google Cloud’s Scheduler service

The last idea was rejected as introducing too much complexity and dependency between a single point of failure and the rest of the system.

Other approaches were evaluated in the form of an internal Clojure library, which we called Leader — it provides a uniform interface for picking a cluster leader, and is backed by Consul, Zookeeper or jGroups (not all at once of course, backends are pluggable).

In the end, all approaches were documented, built, tested, and… we have rejected all of them.

Run less stuff

Anything that introduces a new piece of infrastructure should be evaluated in the most strict way possible. It’s not only about solving the problem at hand, but also answering the following questions (and possibly more):

who’s going to manage the new piece?
what are the failure modes?
how are we going to manage upgrades?
is this suitable for our scale?
what is the associated infrastructure cost?
what are the security and compliance implications?
is this a single purpose tool or can we use it in other scenarios?

Once all of these questions where taken into account, it basically ruled out introducing solutions based on Consul or Zookeeper (and similar systems). Our deployment setup is incredibly simple — we deploy containerized applications into dedicated VMs and these are backed by internal load balancers (all managed with Terraform). Adding another layer of state to our deployment introduced yet another layer of complexity.

The last choice was the Redis based solution — but that also didn’t feel right. Redis usage is very light — we could potentially add one more workload to it without affecting the rest of the system. However, the most popular approach, Redlock seems to have a lot of issues and we weren’t comfortable with having to deal with potential problems caused by clock skew etc.

Back to the drawing board.

Lockjaw

This is where Lockjaw comes in — it’s a small library which exposes Postgres’ advisory locks as a pluggable Component.

That’s all we needed — ensuring, that any given time only one instance of a service needs to do work. No magic, service discovery and running extra infrastructure.

Usage is as simple as the library documentation states: upon connecting to Postgres, each service also starts their scheduler and lock components. Whoever connects first (usually) obtains the lock and carries on working, whereas every other instance keeps checking if it’s their turn. In case current lock holder exits (restart, OOM, etc), remaining instances will try to get the lock and so on.

I’m sure this approach has problems, like what if Postgres is down? If that’s the case, we have a bigger issue!

Eternity

With a simple solution to the locking problem, plugging it into the rest of the system was pretty straightforward. In fact, our existing scheduler component — Eternity did not require any changes.

It’s all functions and data, man

Middleware pattern in Clojure, is very simple to implement: it’s a function which wraps another function and returns a new function 💡

Here’s a very simple logging middleware and how to use it:

(defn with-logging [scheduled-fn]
  (fn [component]
    (log/info "about to do work")
    (scheduled-fn component)
    (log/info "done!")))


(defn do-work [component]
  (insert-data (:db component) {:id "abc" :name "Bananas"})) ;; and then
 ((with-logging do-work)) ;; will log, do work, log again

Since Eternity’s core function (no pun intended) is to periodically execute functions, adding a locking middleware was very simple. In fact, it was so simple that we didn’t have to change any of the existing code, just the scheduler component setup. Further more, we were able to remove code and the configuration responsible for activating (or not) the scheduler during application boot.

Conclusion

The following setup has been live for 3 months now and we have not run into any issues, we were able to reduce the size of our deployment as well as simplify service configuration.

Note: it’s been nearly 4 years since I originally wrote this post, and Lockjaw use at EnjoyHQ only increased over the years. I’m also using it in Collie.

In case we adopt more dynamic ways of deploying our services (Kubernetes etc) we’re ready — as we can guarantee that even during scale up/down events only one instance is going to run the scheduled operation.

Both Eternity and Lockjaw are open sourced, and you can check them out (and use :-)) on our GitHub page.

I’d buy a couple 😉