TL;DR: we switched from RethinkDB and ElasticSearch to Postgres, leaning heavily on Haskell in order to fill in some of the gaps quickly. The project was a success, and we’re very happy with the switch. Haskell has been invaluable for refactoring safely and confidently.
Motivation
At Lumi, our backend stack has historically used RethinkDB for the primary database and ElasticSearch for indexing and search. This has worked well for the most part, but it has recently started to show its limitations.
In this post, I’ll talk about some of the more interesting problems we faced while migrating from RethinkDB and ElasticSearch to Postgres.
In order to understand why we decided to switch, I need to explain a bit about our data architecture. RethinkDB was used as the primary source of data, but in order to implement searches on computed fields (for example, statistics), we would previously precompute these fields and store them in ElasticSearch, updating them as necessary. This gave good search performance, but obviously came with its own set of issues. Specifically, it was tricky to ensure that the computed fields were always up to date, and it was difficult to be sure that this property would hold by considering the code.
Now, this is obviously an issue with how we used RethinkDB and ElasticSearch, not an issue with the tools themselves. But in the absence of some more sophisticated approach, we decided to go with a low-tech solution and simply compute these fields on the fly, as functions in the database. This was something that we could implement easily in Postgres, and our experiments suggested that the performance would be on par with our RethinkDB/ElasticSearch setup.
Here are some other concerns we had:
RethinkDB:
- Performance — Joins can be slow or even crash the database. Computing certain statistics on the fly in Rethink didn’t seem to be an option.
- Correctness — The query language is untyped, which can make it difficult to compose large queries if you’re used to working with types.
- Support — Rethink is no longer commercially maintained.
ElasticSearch:
- Correctness — There are many types of indexes, and the same operations can give different results based on the exact type of index you picked. That said, the
bloodhound
library does a great job of presenting a sensible API to ElasticSearch.
Postgres also had its own appealing characteristics:
- Features — Full SQL with many useful extensions. ACID guarantees. Schemas make sure we don’t invalidate existing data.
- Performance — Postgres can handle complex queries easily. If we have performance issues, we can always
EXPLAIN
a query and debug. - Correctness — SQL isn’t a perfect language when it comes to correctness (
NULL
handling jumps to mind), but it's pretty good, and relational algebra provides a good foundation for reasoning about correctness. There are also excellent Haskell libraries which wrap Postgres in order to provide additional type-safety. - Support — Postgres is basically going to be around forever, and it has an excellent, active community.
Getting Started
First of all, we decided that all access to the database would go through our Haskell web services instead of being spread between various Node and Haskell services. This should have several advantages, but primarily, it means that every read and write gets type-checked. We can’t put any data into the database which is incorrectly formatted.
Next, we decided that we’d use persistent
to describe the database schema. persistent
has some support for building queries, but we decided to use esqueleto
to build our queries since its query language is much richer. Anything we can't write with esqueleto
is written in a SQL function and imported by name on the Haskell side. This gives us a nice split between type-checked Haskell and the full power of SQL with Postgres extensions.
persistent
has the ability to generate migrations for any schema changes, so we use persistent
to generate all of our schema update scripts.
Migrating Data
Migrating all data from our existing production database into a shiny new Postgres RDS instance was an interesting challenge.
We approached it like this:
- Run the
persistent
migration script and drop all constraints temporarily. - Use
servant
to generate a simple web service with one method for each of ourpersistent
entities to insert data into the database. - Run a Python script to pull data from Rethink and push it to our new API, reporting any errors (which usually amounted to schema discrepancies, and which we were able to fix over the course of a few days).
- Fix up any bad data in place in the database using a collection of ad hoc scripts. This step was necessary since data in RethinkDB had become bad over time due to the lack of an enforced schema and presumably bugs in our code.
- Add back any constraints we dropped and create indexes.
Replacing RethinkDB and ElasticSearch
We were using ElasticSearch in three ways:
- To store computed statistics, and to make those searchable,
- To perform simple searches against string and numeric values, and to perform the occasional full-text search.
- To parse simple query strings to support more interesting searches.
Computing Statistics
Instead of pre-computing statistics and storing them in ElasticSearch, we decided to compute all statistics on demand. This was something we previously tried in RethinkDB, but the results were not good. When joins were involved, queries were often way too slow and in some cases caused the database to crash.
When we tried implementing statistics as SQL queries in Postgres, we were amazed by the performance. We could implement complex statistics involving data from many tables, and with some help from the query analyzer, we were even able to beat our previous query performance in most cases. Note that we were simply retrieving stored results previously, but now we were computing all information on the fly. This is a great improvement for correctness, but it’s also a testament to the fantastic performance of Postgres.
There is the possibility that this approach will not scale to larger datasets. However, there are many options if that becomes a problem. Aside from simply throwing more hardware at the problem, we could consider things like materialized views, or a separate data warehouse entirely, trading liveness of our data for improved average query speed. Another option might be to compute statistics incrementally, which is an area where I think Haskell would really shine.
For now, in order to implement our stats queries in esqueleto
, we designed a type class:
Here, the functional dependency is implementing a function from entity types to types of selected
computed fields (internal to esqueleto
) and the result
type (public in our exposed API).
The HasStats
class is a wrapper, providing a value of the SomeStats
helper data type for the appropriate entity. SomeStats
packages up the functions we need in order to compute statistics.
First, it packages a function which turns our entity type into a collection of selected fields. The SqlSelect
constraint ensures that we can actually turn these queries into SELECT
statements using esqueleto
.
SomeStats
also packages up a function rep -> result
which can turn esqueleto
's representation of this data into an actual result.
These two functions give us everything we need in order to select, filter, and sort by our selected statistics in our API.
Simple Searches
Simple searches were by far our biggest use case, so we decided to focus on that. For now, we are not going to worry about full-text search since we didn’t use it very much or in any crucial ways (even though, arguably, this is the real reason to use ElasticSearch in the first place). Thankfully, Postgres provides functionality for full-text search (what can’t it do?), which we will probably look into if we find we need to restore this functionality.
Our simple search functionality varied from entity to entity, so we created a new type class to capture the mapping from entity types to search functions, which are implemented as esqueleto
queries:
This looks complicated, but the complication just arises from the need to support searches which involve joins between multiple tables. The SearchData
type family captures all of the entity information we need to pull in order to perform a query against a single row. The createTextQuery
applies a search term to that structure to form a predicate which we can pass to the database.
Replacing ElasticSearch Query Strings
In order to replace ElasticSearch’s simple query string parser, we needed to implement our own small search DSL and parser. With Haskell’s parsec
library, this was very simple. We designed a small language of search terms:
and a parser for these terms using parsec
. For the most part, this expression language is self-explanatory, but the SpecificMatch
constructor requires some explanation.
We would like to support query strings like firstName:John
which matches a specific field. In order to turn this into a typed query which can be passed to esqueleto
, we need to actually check that the firstName
field has the right type — a string, in this case. Note that this is a type-changing operation — we go from untyped queries to typed queries.
That operation can fail, so we implement it using an error tracking monad. But we need to perform that operation against every SpecificMatch
appearing in a Query
. By deriving Traversable
, we can simply use traverse
to perform this operation globally, for free!
Notes on Dynamic Types
When we were using RethinkDB and ElasticSearch, everything was untyped, and we could pass names of fields directly to the database without type checking. If the types didn’t match, we would either not find out, or find out by catching an exception from the database driver. This was obviously not ideal from a debugging point of view, but it certainly made implementation easier.
In order to keep the same search functionality, we had to find a way to use untyped field names with the typed fields exposed by persistent
and esqueleto
. This problem also shows up when performing other operations such as filtering and sorting rows.
To solve this problem, we implemented a new type class in order to map our entity types to lists of acceptable field names:
Here, selectRows
is simply an association list of field names to column data.
What is SomeColumn
?
In order to produce a value of type SomeColumn entity
, we must have a type typ
of SQL expressions in mind (and instances for the required constraints on typ
), along with a function of type entity -> E.SqlExpr (E.Value typ)
which can pull the appropriate field out of our entity type. In this way, SomeColumn
captures both the type of the column, and how it’s read from the database.
The KnownPersistField
constraint here is key. It provides a way to check the inferred type of a field against the type of filter provided by the user at runtime:
PersistFieldProxy
is a value level representative for a column type. It reflects all of the possible types of database columns we care about, at the value level. The KnownPersistField
class simply says that we know the column type, and can provide a PersistFieldProxy
.
Unlike the type-level representation of column types, we can case
on a PersistFieldProxy
to implement different behavior for different column types.
Pattern matching a PersistFieldProxy
can also bring type information into scope. For example, we can check if a column type is a string, in which case we can bring a SqlString
instance into scope. This allows us to use string queries in esqueleto
:
Combining the SomeColumn
data type and the KnownPersistField
class, we have enough information to turn runtime strings into typed expressions, and to verify the type of any associated data provided by the user.
This was quite a lot of work in order to recover functionality which we got “for free” from RethinkDB using dynamic types, but I think it is worth it. Now we are able to turn any untyped query from our API into a typed query and we are able to inspect type information at runtime and act on it in various ways. Since we are able to catch type errors before they hit the database, our errors are better and we even catch some errors that the database driver would not report.
JSON Support
RethinkDB allowed us to store arbitrary JSON documents in columns, and perform quite expressive queries against fields in those documents. We were initially concerned that we might lose that functionality in moving to Postgres, and that we might have to store more data in columns.
However, we were pleasantly surprised to discover that the JSON support in Postgres was sufficiently advanced in order to implement all of the functionality we were using in RethinkDB.
We can pull nested data out of JSON documents and filter and sort based on those values. We can even join with an array of data stored in JSON format, and then aggregate those values in various ways.
We don’t store many JSON documents in the database, but for some cases, it is definitely preferable to storing flattened data over many columns. To support these cases, we write custom SQL functions which are then imported into Haskell as typed functions on esqueleto
expressions.
One downside is that Postgres does not support JSON schemas, so it is possible to get into invalid states over time. We rely on Haskell to type-check our data on the way in to the database, but we have to be careful that migrations don’t invalidate existing data. We could look into using something like postgres-json-schema
to solve this problem eventually.
Conclusion
The migration to Postgres was a success and it improved the performance, correctness, and maintainability of our code. There are still plenty of interesting ways we could improve things, and new Postgres features to explore.
It’s been an interesting challenge to recover all of the functionality and flexibility that we had with RethinkDB and ElasticSearch, but I hope I’ve shown that Postgres handled the job well, and that Haskell turned out to be crucial for such a large refactoring.
If you’re interested in using Haskell to solve real-world problems like these, then get in touch — we’re hiring at Lumi!