CouchDB Writes: Piecemeal, Bulk, or Batch?

Which write API endpoint is the right write call for you?

--

There are many different deployment models for CouchDB-style databases, but thankfully CRUD operations work the same across all of them. Apache CouchDB™ is a database, specifically a JSON document store, with an HTTP API. IBM Cloudant is Apache CouchDB with a few extra bells and whistles run as-a-service in pay-as-you-go, dedicated, and local configurations.

In this article, I’m going to explain the various options for writing data using the CouchDB API, and I’ll look at the different endpoints and tradeoffs along the way. First, however, it will help to understand the basics of CouchDB as a distributed system, and what the database means when it says your writes are written.

CouchDB clusters 101

A CouchDB cluster is a distributed system that exposes a single API — you treat your CouchDB cluster as a single data store, but behind the scenes your database is divided into shards and multiple copies of your documents are stored on separate machines.

A 6-node CouchDB cluster.

The larger the number of nodes in the cluster, the greater the volume of data and the number of concurrent requests it can handle.

When you write data to a CouchDB cluster (or a Cloudant service), by default, two or more copies of your document are persisted on disk (for example, on two or more machines in a 3-node cluster). Other database systems may give you the thumbs up to your write requests before the data is written to disk as a speed optimisation — behaviour that risks data loss in the event of a node failure.

Now that you know the basics of what’s happening behind the scenes, I’ll cover the API calls that allow you to write data to CouchDB and the options you have that trade off storage guarantees and performance.

Piecemeal writes

Writing data to CouchDB is simply an HTTP POST request:

curl -v -X POST \
-H 'Content-type: application/json' \
-d '{"name": "Mittens", "type": "cat"}' \
"$COUCH_URL/animals"
HTTP/1.1 201 Created
Cache-Control: must-revalidate
Content-Length: 95
Content-Type: application/json
Date: Fri, 02 Jun 2017 06:33:08 GMT
Location: http://localhost:5984/animals/
{"ok":true,"id":"7bff55e2a7f9fa3a999c1f76bd00044b","rev":"1-76558a77771fb4c1f81d4d91144dc83f"}

Here, I POST a JSON document to my database, and the reply indicates the auto-generated id of the document that was created. The "201" response code indicates success and guarantees that the document was stored on a quorum of servers in the cluster (at least two of the three shard copies).

Below, I have a 6-node cluster, which means a database is sharded across all six nodes. Additionally, the system maintains three copies of that database, and it ensures that the shard copies reside on different physical machines. This means there are three copies of each shard, each on a different machine. The cluster below shows what this write looks like when it has fully propagated throughout the cluster.

A basic write request for the new “Mittens” document. Note that each node in the cluster does not get the write—rather, it’s each of the three shards that gets the write.

This process is important for mission-critical data. It means that if the servers were abruptly powered off, your data would be safe on disk on multiple machines.

Bulk writes

If you have lots of data to write to the database, then a single bulk API request is more efficient than making several individual API calls. More efficient in terms of fewer HTTP round trips, and more efficient for the database cluster too:

curl -v -X POST \
-H 'Content-type: application/json' \
-d '{"docs": [{"name": "Snowy", "type": "cat"},{"name": "Patch", "type": "dog"}]}' \
"$COUCH_URL/animals/_bulk_docs"
HTTP/1.1 201 Created
Cache-Control: must-revalidate
Content-Length: 192
Content-Type: application/json
Date: Fri, 02 Jun 2017 06:44:45 GMT
[{"ok":true,"id":"7bff55e2a7f9fa3a999c1f76bd001d39","rev":"1-263fbfee100b3417c513b14f4dacd776"},{"ok":true,"id":"7bff55e2a7f9fa3a999c1f76bd00202b","rev":"1-591fadc21c08df0ba8efa5c5912c1cfb"}]
A basic bulk write request. Two more new documents are added, this time together as an array.

In this case I supply an object containing an array of documents and, in reply, I receive an array of objects. The body can contain inserts, updates, and deletes:

{
"docs": [
{ "name": "Paws", "type": "cat" },
{ "_id": "7bff55e2a7f9fa3a999c1f76bd001d39", "_rev": "1-263fbfee100b3417c513b14f4dacd776", "name": "Snowie", "type": "cat"},
{ "_id": "7bff55e2a7f9fa3a999c1f76bd00202b", "_rev": "1-591fadc21c08df0ba8efa5c5912c1cfb", "_deleted": true}
]
}
Bulk write requests can contain a mixture of inserts, updates, and deletes. The bulk request adds Paws, updates “Snowy” to “Snowie”, and deletes Patch.

Is there a limit to how many documents should be posted in a single bulk request? There isn’t a limit per se, but 500 small documents is a reasonable rule of thumb.

Note: The pay-as-you-go Cloudant plans limit the size of POST requests to 1MB.

Batch writes

In some circumstances, it is not possible to combine writes into fewer bulk requests. For example, if your application is running on a serverless platform such as OpenWhisk, then your code has no visibility into the other serverless actions that are performing similar requests concurrently. This example is where batch mode may be of use. By supplying ?batch=ok to a single write request, you are indicating to the server that you permit CouchDB to buffer the document in memory before writing it to disk in batches:

curl -v -X POST \
-H 'Content-type: application/json' \
-d '{"name": "Tiddles", "type": "cat"}' \
"$COUCH_URL/animals?batch=ok"
HTTP/1.1 202 Accepted
Cache-Control: must-revalidate
Content-Length: 52
Content-Type: application/json
Date: Fri, 02 Jun 2017 07:01:50 GMT
{"ok":true,"id":"7bff55e2a7f9fa3a999c1f76bd002cec"}

In this case, I get a “202” response, indicating that the document is accepted but not written to disk (yet). This behavior is faster and more efficient than piecemeal write performance, but doesn’t provide any persistence guarantees. Batch mode should not be used for writing critical data to the database but may be useful for some applications.

References

I hope this article helps you take better advantage of CouchDB and Cloudant. I learned the finer points of write behaviour as a user. If you want to read a more scholarly article on how CouchDB handles writes—from an engineer on the Cloudant team who is closer to database internals—then this blog by Mike Rhodes is a great place to start.

If you enjoyed this article, please ♡ it to recommend it to other Medium readers.

--

--