Lessons learned developing the Storex database abstraction library

Vincent den Boer
9 min readMay 28, 2022

--

Originally published on my personal blog.

Back in 2018 we migrated Memex to IndexedDB to enable fast client-side full-text search for everything you read on the web. And as with every database without a large ecosystem, we were missing basic tools to describe our data model, how it changed over time and performing schema migrations. So we implemented a way to describe our data model and a single point through which all database read and write operations flowed.

Those simple ideas evolved into Storex, the database abstraction which helped us evolve from single-device with client-side full-text search to our own peer-to-peer encrypted multi-device sync to our eventual transition to the cloud sharing code across IndexedDB, SQLite and Firestore.

The core idea

While many database abstraction layers create complex systems to make using the database look like object-oriented programming, I wanted to keep things simple, so instead schema descriptions, storage operations and their results are simple data structures with all storage operations being executed by a single function. This simplifies a lot of functionality you can build on top of it.

One such thing is middleware. You can make operations go through other functions before it arrives at the function that actually interacts with the database. This allows you to do things such as logging operations and their results for debug purposes, or write executed operations to a log that you can send to other devices to sync their databases.

Let’s take the example of debugging storage operations. If you were using something like Firestore or Mongodb directly, the application would be sprinkled with code like this:

firestore().collection(‘test’).where(“key”, “==”, “value”)

f we’re lucky, the client library provides some kind of in-built logic. But if not, even if I do some kind of weird wrapper, I still have to think hard about how I’m going to represent this query in the log and have to do some tracking to get the end result of the chained methods.

The situation is better when you use raw SQL. There you at least know there’s a string containing the SQL query you can dump:

SELECT * FROM test WHERE `key`=`value`;

But what if you only want to dump queries only involving the table “test”? You could parse the SQL string, but why do you have to do that in the first place? In Storex, operations look like this:

storageManager.operation(“findObjects”, “test”, { key: “value” ))

Everything you need can be found in the arguments of that call. You know it’s a findObject operation on the collection test. No SQL parsing, nor state tracking necessary.

But here you can also spot what I’d like to have done differently. Right now, operations are represented by an array of arguments to that function. Instead, I’d like to have operations be a self-descriptive object:

storageManager.operation({
type: “findObjects”, collection: “test”, where: { key: “value” }
})

Keeping complexity to a minimum

Being a team of only two programmers, we needed to minimize the overhead of developing a new library so we could focus on building the product. This forced us to not build a general-purpose library that could do everything for everyone. Instead, we kept Storex simple enough to be able to quickly add functionality as we actually needed it. This worked out really well for us! Had we tried to implement a storage layer that could do everything IndexedDB, SQLite and Firestore can do, we’d probably have gone bankrupt. Instead, we implemented the basics like createObject, findObject(s), updateObject(s) and deleteObject(s) and just a few query filters ($eq, $gt, $lgt, $in, etc.) and we found that these were easy to implement consistently across these three very different databases!

As a bonus, this allowed us to implement testing backends running completely in-memory in both the browser and Node.js even with Firestore which does not have such functionality officially supported. We could write all business logic, confirm tests pass, then as a last step run them with the Firestore emulator. The entire backend could even run in-memory in the browser! Together, this meant that during daily development, we only needed to run the front-end, while having the final tests needing everything to be set up to be just a tiny last part of development.

But, some things could’ve been done better to reduce complexity even further:

  • I thought it’d be useful to have collection names automatically be available both in singular and plural form. I pulled in an external package, not giving any thought to the costs. This was pretty useless and just adds unnecessary code to the final application.
  • I thought it’d be useful to be able to create objects with all their relations in one operation, like operation(‘createObject, ‘user’, { displayName: ‘test’, emails: [{ address: ‘test@test.com’ }] and let the storage backends automatically create the user first and put its ID in an automatically created e-mail. But actually, this was never used in our application and just added useless complexity to each storage backend.
  • I created a tight coupling between the storage backends and the storage registry describing the data schema. But actually, each storage backend only needs to know different, specific things. Dexie needs to know the entire history of the data schema, whereas other databases don’t. Some don’t care about relations, where SQL table creation does. Firestore needs to know which fields are date fields. Instead of tightly coupling of one specific way of describing the data schema, it would’ve been enough to pass in functions to each backend to answer the specific questions the needed to ask (such as isTimestamp(collectionName: string, fieldName: string): boolean.)
  • From this tight coupling to the storage registry came one additional mistake. Versioning of the schema was baked in and I thought it’d be a good idea (for whatever reason) to use dates as version number. Aside from being confusing, this also doesn’t really help if you factor out different parts of the schema into separate libraries where the library schema version doesn’t cleanly map to your application schema version..
  • A tiny nag also creating some complexity in storage backends and middleware is the divergence of the equal filter from the rest of the operators. Every operator works like { $lt: 5 }, { $in: [1, 2] }, etc. but you can (optionally!) write “equals” as { key: value } instead of { key: { $eq:‘value } }. It’s small, but it occurs often enough to be annoying.

See a common theme here? “I thought it’d bla bla bla.” Instead of thinking about what’s theoretically useful, it’s often better to only write the code you actually need, while trying to keep things reasonably clean during the process until you have discovered enough to know what you actually need.

One thing though that could’ve used some more thought is how to handle optional fields consistently across back-ends. Some of them just omit the field when it’s not present in the database, while some return undefined or null for them. This makes integration tests fail when you switch backends. Not the end of the world, but an example where a bit of extra thought could’ve saved time.

Unfinished experiments

With the basic building blocks in place to describe data schemas and execute storage operations, we started to organize related storage logic in storage modules. These were classes combining schema description, pre-defined storage operations, access rules and higher-level storage operations into one place. It was a partial success. By pre-registering storage operations, we could easily programmatically see all the storage operations the program ever does. When we were using Firebase and decided we wanted to further restrict what queries users could do, we could disallow operation executions from the front-end and move them to the back-end with minimal changes. But, this pre-registration did introduce more indirection and thus onboarding costs, along with lack of type-safety leading to frequent, but easy to detect and fix bugs. I’d like to develop a better version of storage modules cutting out some of the complexity and indirection, while making it more type safe.

When we started to develop part of the product using Firestore, we had to work with their form of access control. Users can execute any queries they want, but there’s a security rule file deployed to decide what the user can read and write. This gives a lot of convenience (which is part of the reason people like GraphQL), but also comes with its own limitations that we were aware of in advance. We also wanted to be able to migrate away from Firestore. So I analyzed the use cases that were about to come up shortly and in 3 days of work I implemented a higher-level description (again, in plain data objects) of the access rules: which fields represented ownership, some data-driven logic, etc. Combined with the data-driven description of the schema, it was easy to generate Firestore rules to avoid lots of easily avoidable, but common mistakes while improving the general understandability of what was trying to be expressed. The idea was to implement a Javascript library to enforce these rules on operations from clients, so if you’re using another kind of database, you could still use the same mechanism of executing any query from clients, but filtered through these security rules. Another idea was to bake filtering on listing queries into these rules, which Firestore itself lacks. Both of these ideas were not needed by us yet, so never implemented. The base idea of making access control data-driven worked wonderfully though and provides a lot of unexplored potential. That being said, the reason why it works for us is being small and agile, understanding the entire codebase that works with these rules and being able to expand and modify it when needed (even though only bug fixes were made since its initial implementation 2,5 years ago.)

From the plain object describing the application data schema, it was easy to generate Typescript types representing those database objects. No methods to read and write them, just the plain fields like

interface Email {
address: string;
primary: boolean;
}

This is very nice to have, because there will be many places in the program that process this data in some way, but you don’t want them all potentially reading and writing from the database or have their types coupled to something overly complex (like meta-programming magic that a lot of ORMs tend to do.) However, the combinatorial explosion of variants of those types is not elegantly solved yet. Do I want that object with its primary key because I’m reading it, without it or optional? Which relations do I want included, like a user with an (optional?) profile, an array of e-mails and any other of the many types of data attached to an object? The pragmatic solution was to generate types with only the data fields as the one above, excluding primary keys and relations. But, this has led to hacky solutions that lack type safety to make up for it.

Data schemas evolve over the lifetime of an application. Usually, each database library has their own schema migration solution (if it has one) that involves running some code. However, by already having the schema as a plain data object combined with some data-driven, you can get a lot more flexibility out of your migrations. You can transform them into SQL scripts, use them to transform data on-the-fly from outdated sources, or apply them to your test data sets to update them. While some tools were developed to calculate differences between schema versions, combine them with data-driven logic and execute them, it didn’t get to the point of being used in production yet. It was always faster in the short-term to hack together a solution solving a specific problem, rather than making things easier and more reliable for the future.

What’s next?

Everything together, I consider Storex to be a successful experiment giving us outsized returns on investment. There’s many concepts I’d like to explore further, like the balance between making application data-driven vs. code-based. There are many things that could be built on top, like Storex Hub (a local data store tying together data from many different sources) and easy-to-use mechanisms for decentralizing applications. But those are grander visions. First, I’d like to get a stable version of Storex out, useful and easily accessible to a larger public. In that version, I’d iron out the kinks described above, simplifying, decoupling and making operations self-descriptive.

If you enjoyed reading this, in my newsletter I can keep you up to date about new articles like these about how you can build software differently so can build faster, collaborate better across teams and in the end, deliver products that are more useful.

--

--

Vincent den Boer

I help people rethink how they collaborate building digital products and deconstruct complex problems. CTO at https://memex.garden/