Replacing SQLite with Cell, part 1: Meet the programmable database
Despite the availability of a vast selection of different NoSQL data stores, relational databases remain one of the most popular options for persisting data, yet the impedance mismatch between SQL databases and application code, on top of the extra effort required to interface with any type of database, remains in many cases a problem, and that’s especially true when one has no need for all the features provided by a standalone DBMS, like server-side storage or multi-user support, but only needs a way to save data locally.
This series of articles will explore a different approach, one that makes use of an embeddable programming language, Cell, that has native support for the relational model. You can write your database schemas in Cell, along with all the associated logic (both queries and updates), and from that the Cell compiler will generate a corresponding C# or Java class that you can include in an existing project. In addition to providing all the functionalities you would expect from any embedded SQL database, like persistence, transactions and data integrity checks, Cell offers the following advantages:
- Usually there’s no need to do any object/relational mapping at all. The compiler will take care of all the necessary data conversions. With a bit of careful design, the classes generated by the Cell compiler can be almost as easy to use as hand-written ones.
- When you implement your database logic, you’re not limited in what you can do by the lack of expressiveness of a weird data sub-language like SQL, but you’ve a general-purpose programming language at your disposal that allows you to implement any sort of behavior.
- The generated code is very fast, offering performance similar to that of equivalent, hand-written, object-oriented C# or Java code.
- You’ve a much better data model at your disposal than the one provided by SQL: among other things, the columns of your relations/tables can be of any user-defined type, inheritance hierarchies can be modeled directly, and schemas can be defined in a modular way.
- The combination of general-purpose programming capabilities, speed and a very flexible type system allows Cell to naturally model datasets that you would not normally store in a relational database. It can also be an alternative to, for example, a graph database like Neo4j, or document stores, or just writing and reading directly from/to a JSON file.
Moreover, Cell is not just a toolking for creating embedded databases, but a fully-featured embeddable programming language that can be used to model and implement even the parts of an application that don’t need to be persisted, but which can benefit from all the features Cell provides that are not available in conventional languages, like its very high-level data model, transactions and the ability to replay the execution of a program and to easily inspect its state. Parts of an application that are not themselves persistent but which are tightly coupled to persistent ones are especially likely to benefit from being implemented in Cell, since that means you can use a common language and notation for all of them.
One intrinsic limitation of Cell though is that it can only be used to create in-memory databases: the entire data set is loaded in RAM at startup, and from that point on the disk is touched only to save changes to the data. For data sets that are too large to fit in memory, Cell is not an option at the moment.
The Northwind database in Cell
Throughout this series we’ll make use of the Northwind database, a sample database used by Microsoft to demonstrate the features of SQL Server and Microsoft Access. It contains sales data for Northwind Traders, a fictitious specialty foods import/export company. Here’s the database structure:
This is what the definition of the database schema looks like in Cell:
We won’t describe the syntax in detail here (that’s a topic for a future post in this series), but we will highlight the most important differences with SQL. As an example we’ll use the following subset of the Products table:
The first major difference is that while that’s not immediately clear from the syntax, the data that in SQL is stored in a single table is split into several different relations in Cell. The first one is the unary relation (that is, a table with a single column, which is just a set) product, whose only purpose is to store the identifiers of all products in the database:
As you can see, every product in a database is identified not by an integer or a string as in SQL, but by a value of a user-defined type, ProductId in this specific case:
Values of type ProductId are just tagged integers. If you’re not familiar with functional programming languages like Haskell, F# or Elm, think of it as more or less the equivalent of a struct/class with a single anonymous field of type long in C# or Java:
There’s a whole bunch of types like this one at the top of the above schema definition (lines 1–9). The use of user-defined types to identify entities in your domain is one of the things that make it possible to model inheritance hierarchies. Those types don’t carry any information about the corresponding entities, whose attributes are stored in separate relations:
What are the advantages of using per-attribute relations? For one thing, “atomic” relations are much easier to manipulate with a general purpose language. We’ll see how that works shortly. For another, “wide” SQL-style tables can only be used to store single-valued attributes: multi-valued ones (imagine, for example, a customer having several phone numbers or addresses) have to be stored in a separate table anyway. You also need a problematic feature like NULL for optional attributes. Atomic relations on the other hand can uniformly encode any combination of optional or mandatory, single- or multi-valued attributes, by just defining the appropriate keys and foreign keys.
When you write a schema like the above one you define what in Cell’s jargon is known as a relational automaton. For now, think of a relational automaton as a sort of large class that contains an entire relational database inside.
Compiling the database
Now that we’ve defined the database schema, we can try to compile it. That will generate the following C# class:
All there is at the moment is a couple of methods that save or load the state of the database to or from an object of type System.IO.Stream. Cell saves data in a custom, human readable text format. Think of it as a sort of JSON but with relations and algebraic data types. A future version will also add the ability to save actual JSON, for better interoperability, and a binary version of the custom text format, for efficiency.
You can see what the Northwind database looks like when saved to disk here. If you want to know more about Cell’s data model, you can find the official documentation here.
You can also generate a Java class if you wish (and support for more target languages will be added in the future), but here we’ll stick with the C# code generator, because at the moment the classes it produces have a nicer interface, thanks also to a number of C# features (like tuples, named tuples, partial classes and named and optional arguments) that have no equivalent in Java.
Querying the database
It’s now time to write some actual code. We’ll start with this simple SQL query:
In Cell, you need to write a method for the Northwind automaton defined above:
The first line in the body of the method (line 4) is a set comprehension expression: it iterates through all the employees identifiers in the employee relation and for each of them creates a 3-tuple containing the id, first name and last name of the employee. The expressions first_name(e) and last_name(e), where e is an employee id, retrieve the value of the corresponding attribute. The second line sorts the set by last name, the third field in the tuple.
We can now recompile the database. The generated C# class has now a new method, SortedEmployeesNames():
The signature of the new method is almost identical to that of the Cell method it derives from. The only difference is the type of the first field of the returned tuples. In order to move data between the Cell and C# parts of the code base the compiler needs to map each Cell type to a corresponding C# type. In some cases the mapping is trivial: that’s what happens in the above code with strings, tuples and arrays. For a tagged value like EmployeeId the compiler simply discards the tag (which is know at compile time and therefore carries no information) and returns the value wrapped inside.
We’ll now try something slightly more complicated: for each order we want to calculate the total amount we need to charge. This is our SQL:
and this is the Cell version:
The Cell version is longer, but it’s written in a more modular fashion. That will come in handy later. These are the corresponding methods in the generated C# class:
Note that orders_totals returns records instead of a tuples, which are by default mapped to named tuples in C#.
The following SQL query also calculates totals for each order, but it does so only for orders that were shipped between two given dates, and sorts the results by date of shipping:
This is the Cell implementation:
and this is the signature of the corresponding method in the generated C# class:
You can see how the Date type in Cell has been automatically mapped to System.DateTime in C#.
Next steps
That’s all for part 1. In part 2, we’ll see how to implement more complex queries that are difficult to express in SQL but easy to write in a general-purpose language, and how we can customize the interfaces of the generated classes.
All the code and examples for part 1 and 2 are available on github, for Windows or Linux/macOS. You can run all the queries shown in these posts, and you can also easily implement your own.