CouchDB and the data warehouse

Pairing CouchDB, as a primary data store, and a data warehouse for ad-hoc queries gives you the best of both worlds.

One of Apache CouchDB’s best use-cases is as a rock-solid, always-on operational datastore. It is built for fault-tolerance, storing multiple copies of your data on separate servers so that a CouchDB cluster can withstand the loss of multiple nodes without loss of service.

Let’s take the example of an online shop that uses CouchDB to store its transactions. At the birth of the business its sales database is empty, but as time progresses and as the business becomes more successful, the database grows to a healthy size — perhaps hundreds of thousands or millions or documents. The management of the company will be asking questions of the database:

  • How many sales did we make this week?
  • What are the top ten products sold?
  • What are the peak times for selling a particular type of product?
  • How successful was our “Halloween” marketing campaign?

Although CouchDB has built-in MapReduce which can provide simple aggregations of data against pre-defined keys (e.g. sales by year/month/day), it will eventually fall short of a business analyst’s expectations when faced with complex, ever changing queries or questions that relate the sales data to other data sets (such as marketing click-through data).

If you need to ask ad-hoc questions of your data without affecting the performance of your operational dataset, then you need a Data Warehouse.

CouchDB is your high-uptime operational data store, and a Data Warehouse is a query engine, which organises its data in a way that optimises for querying rather than uptime or data resilience. If you need to ensure that critical data is stored in multiple locations with a high availability (and a handful of fixed queries) you need CouchDB. If you need to run an ever-changing set of complex queries you also need a Data Warehouse.

The most common scenario is you need both, with CouchDB data being fed to a data warehouse periodically. In this post we’ll look at how we would write some code to copy and transform CouchDB data before writing it to relational database, allowing us to query using Structured Query Language (SQL).

What do the CouchDB documents look like?

In our example, we are storing one document per completed sale:

How an order might be modelled in CouchDB JSON.

Things to note:

  • This document contains everything we need to know about a sale. There may be further supplemental information about the user/product/payment in other databases, but fetching this document gives us enough information to render an “order summary” web page or email. This is good practice in a NoSQL database — in a database without joins, we don’t want to have to make several round-trips to the database to piece together all of the data we need.
  • A document is created when the payment is confirmed.
  • The document maybe updated later to indicate that the order has been dispatched and to back-fill the dispatchCourierRef.
  • The document contains an array of products in the basket field which store one or more line items purchased.

How can I generate some sample data?

Create a template.json file containing the outline of the document to create with placeholder tags where the data will be placed:

A template for creating sample data in CouchDB

Then using the datamaker command-line tool, create thousands of sample documents and pipe them into the couchimport tool to write the documents to CouchDB:

$ datamaker -t ./template.json -f json -i 10000 |  \
couchimport --db mydatabase --type jsonl

We can use a command of the above form to generate thousands, or millions of documents. Simply change the -i parameter to the number of documents you need.

How can I get CouchDB data into a Data Warehouse?

A quick way of getting a queryable view of a CouchDB database is to use the couchwarehouse command-line tool. Once installed, creating a warehouse is simple from the command-line:

$ couchwarehouse --url https://U:P@host.cloudant.com --db mystore

The utility will attempt to discover the schema of your data, create a local SQLite database with a database table that matches your documents’ schema and begin populating the table from the CouchDB changes feed.

couchwarehouse reads the CouchDB changes feed and writes the data to a SQLite database

In another terminal, you can then run sqlite3 and begin querying your data with SQL:

Querying a copy of your CouchDB data using SQL in SQLite

Having the data in a SQL database database allows you to have complete flexibility in your queries. You can:

  • Query sub-sets of data on any field.
  • Create aggregations, grouped by any combination of fields.
  • Join data to other data sets e.g. secondary “users”, “products”, or “postcodes” tables created in a similar way.
Aggregations are easy in SQL

As long as the couchwarehouse utility is running, new and modified data will continue to spool into your local copy of the database. If you restart couchwarehouse again later, it will continue from where it left off and soon have your data warehouse up-to-date.

The couchwarehouse tool has a few other tricks up its sleeve.

Transforming data before writing to the data warehouse

You can optionally supply a JavaScript function using the --transform parameter which is used to transform each JSON document as it is consumed from the CouchDB changes feed. This is useful for type coercion, patching missing fields or enforcing default values.

Let’s say we want to simplify the basket array and turn it into a single string containing the product names. We would write a transformation function and write it to a file transform.js:

A filter function that modifies each document between CouchDB and the data warehouse

Running couchwarehouse with a transform function is simple:

$ couchwarehouse --db mystore --transform ./transform.js

Splitting multiple document types into their own databases

Some CouchDB users store different document types in the same database e.g. customers, orders and products all together. Usually such documents have a “type” field which distinguishes each document’s type.

Supplying the name of the “type” field to the --split parameter instructs couchwarehouse to create multiple tables - one for each document type e.g.

$ couchwarehouse --db mystore --split type

This will create multiple tables in SQLite, one per document type:

  • mystore_user
  • mystore_product
  • mystore_order
In split mode, couchwarehouse writes each document type to their own table

What’s the catch?

  1. This tool is only suitable if all of the documents in a database are uniform. CouchDB allows you to vary the schema over time, so you can have some documents with fields that others don’t. The couchwarehouse tool assumes the first document it sees is representative of the rest.
  2. CouchDB databases can get “big”. You need enough space on your local machine to store all of your data.
  3. The couchwarehouse tool doesn’t do anything special with embedded arrays. It stores the arrays as JSON text in the SQLite database, although the JSON text can be accessed at query-time using the SQLite JSON extension e.g. SELECT json_extract(basket,'$[0].productId') FROM mystore.
  4. The bodies of conflicted documents are ignored. Only “winning” revisions make it to the warehouse.

The couchwarehouse makes a fair guess at moving CouchDB data to a SQL database without any custom work. For anything more complicated, or to store data in a different database, you’re going to need some custom code.

How could I write my own data warehouse?

Data Warehouses, such as IBM Db2 Warehouse on Cloud are based on traditional, table-based database engines. The data is arranged in tables with a fixed schema. To transfer data from CouchDB to a data warehouse we’re going to need to:

  • normalise the data — store data in different tables with references to data in other tables via keys.
  • create a schema which models all or part of the JSON object.

In this case I can see two tables being created:

  • sales - one row per completed sale
  • sales_basket - one row per item in the basket

The first table, sales, will have the following schema with one row per sale:

+------------------------------------------------------------------+
| field | type | couchDBPath |
|--------------------|---------|-----------------------------------|
| id | string | doc._id |
| customerEmail | string | doc.customerEmail |
| datetime | Date | doc.saleDate + ' ' + doc.SaleTime |
| paymentRef | string | doc.paymentRef |
| currency | string | doc.currency |
| total | float | doc.total |
| status | string | doc.status |
| dispatched | boolean | doc.dispatched |
| dispatchStreet | string | doc.address.street |
| dispatchTown | string | doc.address.town |
| dispatchZip | string | doc.address.zip |
| dispatchDate | date | doc.dispatchDate |
| dispatchCourierRef | string | doc.dispatchCourierRef |
+------------------------------------------------------------------+

The second table, sales_basket, with one row per line-item:

+--------------------------------------------------+
| field | type | couchDBPath |
|-----------|--------|-----------------------------|
| id | int | |
| salesId | string | doc._id |
| productId | string | doc.basket[].productId |
| name | string | doc.basket[].productName |
| variant | string | doc.basket[].productVariant |
+--------------------------------------------------+

The sales_basket.salesId field is a foreign key corresponding to the respective entry in the sales.id field. There is said to be a "one to many" relationship between sales and sales_basket. For each CouchDB document we'd have to make one write to the sales table and one or more writes to the sales_basket table (one per basket item).

After creating the target database tables manually to match the schema of your source data, we would need to write some software, similar to couchwarehouse, that spools data from CouchDB to the SQL database tables. The software would have to:

  • Listen to the CouchDB changes feed from the source database.
  • Transform the incoming documents into “INSERT INTO…” or “REPLACE INTO…” SQL statements that match the target tables.
  • Transform date/time fields in CouchDB into native date/time fields in the database.
  • Deal with changes that signify deleted documents ( deleted: true in the changes feed) would correspond to "DELETE FROM ..." SQL queries in the table or joined tables.
  • Ignore design documents — documents whose _id field starts with _design/.
  • Keep track of the latest sequence token that CouchDB supplies with every change. When resuming the changes feed, CouchDB can be instructed to resume “since” your last known sequence token.
  • Commit the INSERT/REPLACE/DELETE SQL statements to your database. Multiple writes that pertain to the same CouchDB document should be grouped into the same SQL “transaction”. For performance reasons, it is likely that you’d want to write multiple documents’ SQL statements together in bulk.
  • Manage schema changes — although the documents’ schemas in CouchDB can evolve over time, your data warehouse schema (and custom code) would have to be modified to cope with any changes.

It’s pretty tricky to get this right. What happens with null or missing fields? Can your code handle receiving data of the wrong data type? Can your code handle processing the same change again or in a different order (the CouchDB changes feed may provide changes before your supplied sequence token and makes no guarantee on the order you receive the changes)?. Can your code keep a reliable replica of the CouchDB data including foreign key relationships across multiple tables?

If you can overcome this complexity and manage to get a copy of your CouchDB data in a data warehouse then you can use it to explore the data for analytics, data science and reporting purposes without affecting the load on your primary CouchDB database.

I don’t have a magic wand that can perform these actions — it’s going to require custom code written with your application in mind. By all means, use the couchwarehouse source code as inspiration, but it’s going to require some custom work nonetheless.