Adding Cloudant search to your static Jekyll blog

Making a static HTML website have dynamic search

I’m a big fan of Jekyll for building static websites. If you’re not familiar with Jekyll, it takes a collection of configuration, templates and source files (I write my posts in Markdown) and transforms them into static HTML files that can be delivered to the world by any web server. Jekyll is built into GitHub Pages so that you can host the source files for your website or blog in a Git repository and have the resultant static web site served out by GitHub Pages without having to manage any server infrastructure yourself. As of May 2018, GitHub Pages now supports HTTPS on your custom domains.

Static sites are fast and easy to manage but without any dynamic server-side components, they may leave your users without features they expect, such as search. In a Wordpress-style blog, the content is served out from a MySQL database and site search is powered by querying that data set.

On a static site, how can you allow your users to search the titles, tags and content of your blog if the data doesn’t reside in a database and there’s no server-side layer that can render dynamic pages? Here’s how it can be done:

It’s as simple as 1,2,3 (4)
  1. Create a static website with Jekyll and serve it out on GitHub Pages, or another static site hosting service.
  2. Write some code to poll the site’s Atom feed. A serverless platform like IBM Cloud Functions can be used to run the code periodically.
  3. Write the Atom Feed meta data into an IBM Cloudant database that has a free-text search index configured.
  4. Query the Cloudant database directly from the web page whenever a search is to be performed.

Let’s dive into the detail.

Building a blog with Jekyll

There are plenty of guides that show you how to build a Jekyll-powered blog on GitHub Pages or follow the Jekyll documentation’s Quick Start Guide.

Once your blog is setup, make sure it has an Atom feed published at the /feed.xml endpoint. This is powered by the jekyll-feed plugin.

Schema design

In order to add a search tool to your static blog we’re first going to need a database of blog post meta data. Using Cloudant as the database, we can store one JSON document per blog post like this:

Blog post meta data as a JSON document

All of this data can be gleaned from the blog’s feed.xml Atom feed with two exceptions:

  • the _id field needs to be unique — we can use a hash of the URL of the blog post.
  • the _rev field is generated by the database and indicates the revision of the document.

First sign up for a Cloudant service and log into the dashboard. Create a new database called blog.

Add a database

In that database we need to define a Cloudant Search index to answer free-text queries. Choose New Search Index from the menu next to “Design Documents”:

Create a search index

Then we can define an index by creating a JavaScript function that is executed for every document in the database, calling index for every value that is to be searchable:

Configure the index
JavaScript is used to define the fields that are to be indexed and which are to be stored in the index.

The index function takes three parameters:

  1. The name of field to be stored in the index e.g. "title".
  2. The value to be indexed e.g. doc.title.
  3. An options object. When store is set to true, a copy of the value is stored unaltered in the index for retrieval at query-time. When index is set to false the value is not indexed for search, but is reproduced in the search results.

CORS and effect

If we want to be able to query our Cloudant database directly from a web page, we need to make two further tweaks to the Cloudant configuration.

Firstly we must enable CORS (Cross-Origin Resource Sharing) in the Cloudant dashboard:

Enable CORS

Enabling CORS instructs Cloudant to output the HTTP headers that will allow an in-page web request (sometimes called an AJAX request) to proceed without an error. By default, the rules-of-the-road for the web wouldn’t allow a web page to fetch JSON from a different domain name, and CORS is the work-around.

Secondly, we need to make the database readable. You can either make the database world readable (grant _reader access to everyone) or create an API Key that grants _reader access to our database of blog post meta data. Both options are accessible from the "Permissions" panel in the Cloudant dashboard:

Make the database readable, or Generate an API Key that has read access.

Now our database is created and set up, we need a script to poll the blog’s Atom feed, convert it to JSON and write it to the Cloudant database.

Atom feed poller

We can write a simple Node.js script to fetch the Atom feed using a handful of npm modules:

The code itself then becomes pretty simple:

Poller source code

The main function is passed an object with the following attributes:

  • BLOGURL - the URL of the blog's Atom feed.
  • ACCOUNT - the admin username of the Cloudant service.
  • PASSWORD - the admin password of the Cloudant service.
  • DBNAME - the name of the Cloudant database to write to.

We can deploy this code to IBM Cloud Functions using the bx wsk tool (substituting your Cloudant account, password and blog URL):

Deployment from the command line

IBM Cloud Functions now has your polling code and is invoking it every 15 minutes. The script fetches your blog’s Atom feed turns it into JSON ready to be inserterd into the Cloudant database and then writes all the records in a single bulk request. It manages to deduplicate the listings because it uses a hash of the document’s URL as the document id — Cloudant won’t accept two documents with the same _id so duplicates are rejected.

Performing searches

Our database should contain some documents. Let’s see them by querying our database’s _all_docs endpoint:

Check that we have data.

You should see a handful of documents.

Now we can query the search index we created earlier:

Perform a search

The the q=*:* matches every indexed record. The array of rows returned contains a fields object containing each item indexed with store: true during the indexing process.

Imagine we wish to answer a user query for documents matching the search phrase “red apples”, then we can construct a Cloudant Search query to look for “red apples” in the description field:

A search for ‘red apples’ in the description field

A better search for this use-case is this:

A more nuanced search for ‘red apples’

The above query matches the title, tags or description fields against the query string, but attaches greater weight to title matches, than tags or description matches. This use of the ^ operator weights the search results to bring more relevant documents to the top of the results.

We can send this query to Cloudant using curl:

Performing the second query from the command line

Querying from the front end

The final piece of the puzzle is making the search request from inside a web page. Here there are myriad options:

Let’s use fetch because it's new and shiny and it couldn't be easier:

Fetching search results from in-page JavaScript.

All that remains is to loop over the returned JSON’s rows attribute to pick out and render the data. How you do that depends on your front-end stack. I like the way Vue.js manages the plumbing between your JavaScript "model" and your HTML "view". There are thousand other ways of building a dynamic page using other frameworks (React, Angular, Ember et al) or none:

Dead simple parsing of the return data.

Example

The Cloudant blog is an example of this technology in action. It is a Jekyll-powered static website whose Atom feed is being polled by an IBM Cloud Functions action and whose search facility is powered by a Cloudant database containing the post’s meta data.