No Monkey Business Static Progressive Web Apps (part 2)

or; How and why Interactive Investor uses decoupled Drupal, Gatsby.js, ReactJS and AWS to deliver rich content without making Google cry

Elliot Ward

Published in

Investing in Tech

11 min readApr 2, 2019

Introduction

In the previous article in this series, my colleague Dominic Fallows explained the drivers for adopting a Gatsby-driven architecture for our public site. In this article we will go in to detail about our Drupal 8 configuration, and how Gatsby pulls the content written by our staff from our CMS.

Note some of the code examples have been simplified appropriately to the principles they are illustrating.

{json:api} — the well trodden path

Gatsby has many different formats via which it can extract content from other systems. Source plugins for specifications such as GraphQL, JSON, Atom, and many more are listed on www.gatsbyjs.org, as well as as sources for specific CMSs, such as Contenta and Wordpress.

Indeed there is a source plugin for Drupal itself already available on Gatsby’s own site. To use this plugin today, you must download and install the contributed JSON:API module in Drupal. However, one of the goals of Drupal’s API-First Strategic Initiative is to include the functionality of this module in Drupal core with the release of minor version 8.7.0 on May 1st 2019. It seems like the easiest and safest way to begin interfacing Drupal to Gatsby.

We don’t use it.

Why we don’t use it

The quick answer is that JSON:API wasn’t stable when we began.

However, we’ve looked since and we’re unlikely to change to it. JSON:API is a specification, and an opinionated one at that. No configuration is required to output JSON:API from your site once you have turned the module on, but what you do output is verbose and comprehensive. Far more data than we want is output, and JS developers working to use this to generate Gatsby pages would have to start learning Drupalisms to understand it. The JSON:API Extras module is available to rename fields, and remove unwanted ones, but this would require a lot of additional configuration. Most importantly, we need to implement some logic to convert what we have in Drupal to a format it can be displayed. If we exported by JSON:API, that logic would have to be reimplemented every where we send the content.

Remember we aren’t only sending our content to Gatsby, we also export to a content API service, responsible for serving content too old to store in Gatsby.

For an example, consider our Article content type. That has a thumbnail field, for use when the article is displayed on a card in our news hub, or on a curated list of articles. However, this was only recently introduced and thousands of older articles don’t have it set. When this the case, we want to fallback to the first image present in the article itself. Optionally if the article has no image, we could fall back to a default based on tags. Or a site-wide default.

Drupal can export content without using any custom or contributed code through its core RESTful Web Services and Serialization modules, but the contributed REST UI module is needed to configure it through the GUI. Once you have configured your site to be able to export content it might look a little like so:

{
  "nid": [
    {
      "value": 507910
    }
  ],
  "uuid": [
    {
      "value": "2235585d-a180-4fd6-9f6c-57b78743df1a"
    }
  ],
  "vid": [
    {
      "value": 796938
    }
  ],
  "langcode": [
    {
      "value": "en"
    }
  ],
.
.
.

Drupal’s REST capabilities give us the ability to customize this representation by writing normalizers; custom services that can rewrite how Drupal renders content suitable for REST export. By implementing such services, we can pass the Gatsby developers the minimum information required to render a page, implementing any fallback logic like the thumbnail handling mentioned above and generation of any image styles not yet present, and also custom caching required.

Normalizers are declared as services in your module.services.yml file tagged with name ‘normalizer’, and given a numeric priority.

services:
  serializer.normalizer.articlenormalizer.json:
    class: Drupal\onesite_rest_normalizer\Normalizer\ArticleNormalizer
    tags:
      - { name: normalizer, priority: 21 }

Normalizer classes specify which data they are capable of normalizing via their supportsNormalization() method. For example, for our JSON normalizer for our article content type, we might implement as follows:

public function supportsNormalization($data, $format = NULL) {
  if ($format != 'json') {
    return FALSE;
  }
  if (!is_object($data)) {
    return FALSE;
  }
  if (!($data instanceof NodeInterface)) {
    return FALSE;
  }
  if ($data->bundle() != 'article') {
   return FALSE;
  }  // Data is a node of type article, this class can normalize it.
  return TRUE;
}

Drupal will use the class with the highest priority whose supportsNormalization() class returns TRUE. So we can ask Drupal to normalize any data without having to know which class is going to actually implement the normalization.

Normalization itself is done by a normalize function, which in our case returns a PHP array that Drupal will convert to JSON format for us.

public function normalize($object, $format = NULL, array $context = []) {
  $attributes = [];  $attributes['author'] = $this->serializer->normalize($object->get('field_author_term'), $format, $context);
  $attributes['summary'] = $object->get('field_summary')->getString();
  $attributes['thumbnailUrl'] = $this->serializer->normalize($object->get('field_news_thumbnail'), $format, $context);  return $attributes;
}

Note that for the author and thumbnailUrl fields, we call normalize again, passing in the fields. This allows us to compartmentalise the normalization for these elements in to their own classes.

This class, plus the relevant classes for the author and thumbnailUrl fields would give us JSON output as follows:

{
  "author": {
    "id": "20",
    "name": "Alistair Strang",
    "path": {
      "alias": "/authors/alistair-strang",
      "source": "/taxonomy/term/20"
    },
    "bio": "<p>Alistair has led high-profile and top-secret software projects since the late 1970s, and won the original John Logie Baird Award for inventors and innovators. After the financial crash, he wanted to know “how it worked” with a view to mimicking existing trading formula and predict what was coming next. His results speak for themselves as he continually refines the methodology.</p>\n",
    "slug": "alistair-strang",
    "email": "alistair.strang@trendsandtargets.com",
    "twitter_handle": "TrendsTargets",
    "profile_image": ""
  },
  "summary": "It could have become a victim of Brexit, but our chartist sees reason for optimism at this lender.",
  "thumbnailUrl": "https://example.s3.eu-west-1.amazonaws.com/s3fs-public/styles/article_thumbnail/public/2019-03/rbs%20logo.jpg"
}

This is visible by adding ?_format=json to the url to a node. Structuring our output in this way makes it obvious to a Gatsby developer how to turn this in to a static HTML representation of the page without worrying about Drupalisms or the conventions of {json:api}.

Once Drupal can output a single entity via REST, you can also build a view to bulk extract it.

Make sure to allow JSON export in the View’s Format Settings:

Gatsby will then be able to call the REST export path to extract the data it needs to build the static HTML.

As we are extracting thousands of nodes, we also set up a mini-pager on this configuration so that Drupal can respond to each request within a reasonable amount of time. We initially defaulted to asking for 150 items per request but after performance monitoring tuned that down to 50.

Make sure you also expose the items per page and offset values — these allow Gatsby to iterate through the results pages and choose the number of items returned for each request by adding items_per_page and offset numeric GET parameters to the REST export path. For example

/rest/views/articles?items_per_page=50&offset=299

Note the items_per_page value cannot be arbitrary. It must be one of the values specified in the “Exposed items per page options” configuration in the mini pager options.

The preview problem

Remember, Gatsby does not currently provide incremental builds. Every time content is created or changed, the content editors will not see that change reflected until Gatsby has built the entire content at least once, and possibly twice if a prior build is running. As we mentioned in our previous article, that comprises approximately 25,000 pages, so even with the improvements in Gatsby 2, our build time is measured in minutes.

This is a quite frustrating situation for content editors, who may see once their edits are rendered by Gatsby that further changes are needed. Indeed, this is such a widespread issue with any decoupled site deploying to Gatsby that following incorporation of Gatsby Inc, their first first hosted product was a preview service, currently with a waitlist.

Using a third party hosted service for this is prohibited by our infosec requirements so we needed another solution. We were already building a Gatsby application that could render Drupal content from its JSON representation. We extended that build process so it also outputs a Javascript application that could render our content that we could wrap in a Drupal module. This was then loaded in our node templates in the Drupal theme, with the JSON representation generated in a preprocess theme and passed to the template as a theme variable for passing in to the Javascript library. This gave editors immediate preview of content, with support for preview within Drupal before saving, and also the ability to go back and view any rendered versions of past revisions of content.

A true preview of Gatsby rendered content within Drupal

To help our editors further, we added controls above the preview to resize to our responsive breakpoints.

This approach has been highly successful for our editorial team. However, it doesn’t seem to be a widely used pattern from talking to people in the Drupal community who are using Gatsby. I very much recommend following this solution to a very common problem!

Caching, cache tag collection, and the cache tag map

Our pages are made up of structured content spread across different data types within Drupal; a page could refer to files, media items, fielded taxonomy terms, other nodes, etc. These items can be saved independently and Drupal’s caching and rendering system will make sure cached HTML representations of pages dependent on these other data elements are invalidated at that time.

A relatively simple example of how pages on ii.co.uk are made up of structured independent elements within Drupal

We want to use Drupal’s powerful caching abilities with our JSON representations too. If we are building a paginated response for a view 150 items at a time, we don’t want to have to rebuild them all every time we build, but it is important that when something is used on multiple pages is changed, all pages on which it is used are updated without having to be independently saved. Out of the box, views of REST exported JSON data use time or URL based cache strategies, neither of which are appropriate for this need.

A related problem is that as well as exporting content to Gatsby via REST, we also send content when published or updated to a separate API so that it can be indexed for search, and displayed even when it it falls outside the range of Drupal content for which Gatsby generates static pages (for example, we only statically generate the most recent 8000 news articles). We need to make sure that when something used on multiple pages is changed, all pages on which it is used are updated to the content API as well.

In order to solve the first problem, we build a caching solution in our JSON normalizers at node level. The first thing such a normalizer does is check for a cached representation of the data it’s looking for:

public function normalize($object, $format = NULL, array $context = []) {
  $cid = $this->getCid($object);
    if ($cached = $this->cache->get($cid)) {
    return $cached->data;
  }
  // No cached version, proceed to normalize from scratch.
.
.

If we had no cached version, then we cache at the end of the normalization:

$this->cache->set($cid, $attributes, $this->getCacheExpireTime(), $cache_tags);

return $attributes;

This immediately gives us primitive node-level caching of our REST output.

In order to extend that to correctly invalidate cache entries for nodes when independent data elements are saved, we take advantage of Drupal’s use of Symfony’s dependency injection to provide a custom service to our normalizers, the CacheTagCollector. This class is simple and provides two methods:

interface CacheTagCollectorInterface {
/**
 * Merge the cache tags for the given cache ID to the collection.
 */
 public function mergeCacheTags($cid, $cache_tags);/**
 * Get the cache tags for a given node.
 */
 public function getCacheTags($cid);}

This gives us the ability to pass this service and a cache ID through our chain of normalizers, merging in the cache tags representing dependent data elements as we go. At the end of the normalization process, we then ask the collection service for all the cache tags used when building the JSON representation so that we can add them when caching it.

$cache_tags = array_merge($object->getCacheTags(), $this->cacheTagCollector->getCacheTags($cid));
$this->cache->set($cid, $attributes, $this->getCacheExpireTime(), $cache_tags);

return $attributes;

This ensures that our cached JSON responses are properly invalidated when independent objects are updated. So should a legally important compliance statement be updated, we know that all cached pages containing that compliance statement will be invalidated, and the updated version included on all when Gatsby next extracts data from Drupal. Epic win!

Throughout the normalization process, normalizers pass cache tags to the the collection service, so they can be set when the JSON representation is cached

Now we have the cache tag collection service, this helps us solve our content API problem too. But for this, it is not enough to just collate the cache tags for all the elements included — we need to be able to remember this information persistently, and at any point in the future.

Enter another simple service, the cache tag map:

interface CacheTagMapInterface {
  /**
   * Set the tags for a node.
   */
  public function setTags($nid, array $tags);  /**
   * Get the node IDs for a tag.
   */
   public function getNids($tag);}

Once we have generated a JSON representation, we can use this class to save the tags the cache tag collector built to persistent (non-volatile database) storage.

How the Cache Tag Collector’s output is passed on to the Cache Tag Map so dependencies can be looked up

Then whenever a piece of content is updated, we can pass its cache tag to the cache tag map service to look up the content that is dependent on it, and resend those to the content API too. Epic win number 2!

Note that because this could potentially require thousands of updates; maybe we’re updating the name of a particularly prolific author, we use a custom queue worker plugin to send the updates, but that’s beyond the scope of this blog.

This is the second post in our series: No Monkey Business Static Progressive Web Apps or; How and why Interactive Investor uses decoupled Drupal, Gatsby.js, ReactJS and AWS to deliver rich content without making Google cry

You can find our public site (free research, news and analysis, discussion and product marketing site) at https://www.ii.co.uk