Detaching Cloudant Attachments to Object Storage with Serverless Functions

How to build a pipeline to move Cloudant attachments to object storage using IBM Cloud Functions

Imagine we have an app that collects geo-coded photographs. We give the app to thousands of students and ask them to collect pictures of storefronts, together with the company name and data from the phone’s GPS. Using the latitude and longitude, the company name, and the photograph we intend to crowd-source a business directory.

Our app is based on Offline First design principles. It stores its data locally on the mobile device, syncing to the cloud when there is a decent connection. This means our intrepid data collectors need not worry about mobile data charges or visiting areas without cellular coverage — they can still store their data on their smartphones and upload it later.

We can use PouchDB for web apps or the Cloudant Sync library for native mobile apps on the client side, and use the IBM Cloudant database service on the server side. This gives us the mobile-to-cloud replication of the business data, including storing the photographs as binary attachments.

Why detach attachments?

Storing binary attachments in a NoSQL document database is handy, but it’s not best-practice in the long term. It’s a useful means of storing binary data, especially on the client side when there’s no network connectivity. In the long term, however, object storage is a natural choice to provide a limitless store of files and is much cheaper than a database per GB of binary data.

We can implement a best-of-both-worlds approach by storing attachments in the database initially, but detaching them later by moving them to object storage as data reaches the cloud.

How does it work?

We need to monitor our Cloudant database’s changes feed, and as each document is updated we need to move any attachments from the document to object storage. Initially, our database documents might look like this:

Here, attachments are referenced in the _attachments object. In this case we have a single attachment called storefront.jpg. After moving the document to object storage our document looks like this:

There are several points to note:

  • The _rev field has changed because we have written a new version of the document.
  • The _attachments object is gone. Cloudant is no longer storing the attachment.
  • There is a new attachments object with almost identical data, except that it contains a reference to a Location (the URL describing where the attachment is stored in Object Storage) and a Key (a concatenation of the document id and the original filename).

The code to do this runs on IBM Cloud Functions, IBM’s serverless platform which is based on Apache OpenWhisk. A tiny piece of Node.js code is deployed to IBM Cloud Functions and configured to run against every change that occurs in the Cloudant database.

Architecture diagram for the detacher service, centered around a serverless function based on OpenWhisk, running on the IBM Cloud Functions service.

In pseudocode, this is what it does:

Load the document by its _id
IF the document contains an _attachments key THEN
FOR each _attachment
write the attachment to object storage
END FOR
remove the documents _attachments key
replace it with a new attachements object
save a new version of the document back to Cloudant
END IF

How do I deploy this myself?

You’ll need a Cloudant account with a database in it, the IBM Cloud Functions command-line tool installed, and an object storage bucket on Amazon S3 or IBM Cloud Object Storage.

Simply clone my detacher source code from GitHub, set your service credentials as environment variables, and run my deploy script:

export CLOUDANT_HOST="myhost.cloudant.com"
export CLOUDANT_USERNAME="myusername"
export CLOUDANT_PASSWORD="mypassword"
export CLOUDANT_DATABASE="mydatabase"
export AWS_ACCESS_KEY_ID="ABC123"
export AWS_SECRET_ACCESS_KEY="XYZ987"
export AWS_BUCKET="mybucket"
export AWS_REGION="eu-west-2"
export AWS_ENDPOINT="https://ec2.eu-west-2.amazonaws.com"
./deploy.sh

IBM’s object storage service supports a subset of the S3 API for easy migration. With a couple simple steps, I can use the same code on IBM Cloud Object Storage that I developed for Amazon S3.

First, upon creation of your bucket in IBM’s object storage, pay special attention to the resiliency scope you select. Your credentials will eventually point you to a list of all possible API endpoints ("endpoints": "https://cos-service.bluemix.net/endpoints"), so it will be handy to know your endpoint straight away without digging through the UI for it. Second, when generating credentials for your bucket, you’ll need to pass in a flag to enable S3-style authentication: {"HMAC":true}. Here’s what that looks like:

Passing in the HMAC flag when generating your bucket’s credentials will create the “cos_hmac_keys” field containing the S3-style access key ID and secret access key you’ll need in order to run your S3 code against the IBM Cloud.

Detacher in action

Now every time you create a document with an attachment, the attached files are automatically moved to your object storage bucket in the blink of an eye.

Click the image to play the gif. An attachment is saved to a JSON document in Cloudant. Upon refresh, it has been moved to object storage, thanks to detacher listening to the Cloudant _changes feed.

Check out the source code for yourself, and start giving those document attachments the treatment they deserve!