Unspaghettiing FT.com’s Content Pipeline

Published in

FT Product & Technology

13 min readDec 2, 2022

a bowl filled with spaghetti and newspaper. the image is AI-generated, and therefore slightly surreal-looking. — a Stable Diffusion-generated image for the prompt “bowl filled with spaghetti and newspaper”

Disclaimer: this blog post describes how we’ve rationalised and improved our systems for article content. These systems were built by very skilled engineers making the best choices they could at the time with the resources available to them. This post is not a criticism of those engineers.

In June this year, we started a project to rearchitect how article content works for article pages on FT.com and the Financial Times mobile apps. In this post, I’ll go into the details of what the problems were, how we decided what to do, and where we’ve ended up.

Would you like pesto, or ragù with that

In the years since the “Next FT” site rebuild, the content pipeline had grown organically, and was starting to look more like spaghetti. Developers tasked with building new article content features had no way of making informed decisions about how, or where, to implement them. Meanwhile, our product landscape had grown more complex, with FT.com engineers working much closer with the apps team, and this year’s launch of the Edit app.

The pipeline was fragmented across several different repos, systems, parts of systems, and libraries, the only way to build new features was all too often to shoehorn new code into somewhere it sort of made sense. HTML transforms and data requests were happening in several different places for one piece of content. On top of that, very little about the original decision-making that went into the pipeline had been written down for posterity.

This problem was highlighted as one of the two main priorities in the FT Customer Products tech strategy for 2022, important enough to “form a small team who between them have deep understanding of article pages, stream pages, the home page, and how FT.com and the mobile apps fit together generally”.

We started out with a goal of learning how our current pipeline was structured, proposing a new architecture, and building an MVP of our new solution, and we had six months to do it.

Figuring out the mess we’re currently in

We started by drawing architecture diagrams of the systems and libraries that made up the current content pipeline. Using the C4 Model we drew nearly a dozen diagrams at various levels of detail, helping us understand the big picture of how data flowed through these systems.

an architecture diagram, with a highlighted box in the centre labelled “next-article”, and arrows leading to and from other boxes for systems involved in this section of the content pipeline. there are 7 systems in total, connected in complex ways. — One of the many diagrams we drew for the content pipeline, this one showing how article content for article pages on FT.com is retrieved and transformed.

To put the rest of this post in context, here’s a brief technical overview of the existing system (and yes, this is the brief version):

An editor would create and publish an article in Spark, the CMS.
Spark posts the article content to the Content API, with the body formatted as abstract XML.
A service watched for new and updated content from the Content API, transformed the article body XML into HTML and did other kinds of transformations to article metadata, and stored it in an ElasticSearch database.
When a user requested an article page on FT.com, it would fetch the content from ElasticSearch, perform further transformations to the body HTML (some of which was part of the article microservice codebase, and some in a separate library) and return it to the user.
Alongside this, the mobile app API fetched article content from ElasticSearch and performed its own transforms to the HTML (with several versions of the transforms still supported in the codebase, for older versions of the mobile app). This API is written in PHP, unlike the FT.com microservices and libraries which are written in Node.js.

At every stage of this, body content is stored as strings of markup that is parsed and re-serialised, upstream datasources are responsible for generating the HTML that will eventually be rendered by a user-facing service, and other services are being called to fetch additional data required for rendering some components.

Overcoming blank page syndrome

Once we’d come to a good understanding of how everything fit together, we had some idea of what we might be able to do to simplify the architecture: we knew we’d want a single codebase encapsulating everything that was happening to article content within FT.com and the apps, potentially a monorepo grouping several related services and packages, and we knew we needed a single API that could be flexible enough to support the differing use-cases for our multiple downstream consumers.

But that still left us with the problem of getting started making the architectural decisions we’d need to make before even writing any code.

Until this point, we’d been working almost entirely remotely and asynchronously. This worked extremely well for the kind of discovery and documentation work we’d been doing so far, but we needed to get moving, so we planned a mobbing day, with the goal of making some inital decisions and getting the whole team on the same page and moving in a direction, not even necessarily the right direction.

a tall white woman with long brown hair wearing a sleeveless top and jeans, standing in front of several whiteboards containing architecture diagrams and pseudocode. a white man sitting down hides from the camera. — They call me Kara “Three Whiteboards” Brightwell (they don’t)

We started by going through the architecture diagrams and documentation we’d produced so far, making sure everybody understood the problem space and had the right context. This led us onto what I’d labelled “whiteboarding (unstructured)” on the agenda: writing down anything that came to mind, with a big red label in the top corner saying “NO STUPID QUESTIONS. THROWAWAY DIAGRAMS ONLY”. For these kinds of sessions, getting everybody to contribute is often difficult, and perfectionism can get in the way of people speaking up. We didn’t need perfect; we needed literally anything.

NO STUPID QUESTIONS. THROWAWAY DIAGRAMS ONLY

By lunchtime we’d coalesced around the idea of a single GraphQL API. This neatly fulfilled our use case of being flexible enough for different clients’ use cases: because everything returned by the API had to be specifically requested in the query, different consumers could query for different content and we’d be returning only what they needed.

We also took the decision to write in Typescript. Although most of FT.com is written in plain Javascript (and the API that sits behind the mobile app is PHP), we knew that for a project like this, with large and complex data structures, having confidence that we were handling the data correctly would be very important.

In the afternoon, we created a “playground” repository for us to start writing code without the mental overhead of having to be anything close to real-life use cases, or production-readiness. This let us quickly prototype an Apollo GraphQL API using the FT Internal Content API as a data source and play around with various ideas for structuring the data and the article body.

Having these foundational decisions out of the way, even with temporary answers, mean we could move very quickly into actually starting to write “real” code, and our playground repository graduated to become a monorepo containing the API and a barebones Node.js app for testing rendering in the context of something that looked more or less like an FT.com article page. We were ready to start building things in earnest and validating our choices.

More like lasagna, actually

Because every stage of the existing system was responsible for both fetching data and generating & transforming markup, even once we’d mapped out the systems involved and how they connected together, it was still very difficult for us to understand what was responsible for building up the final rendering of each particular article feature.

For an example of the convoluted ways we were building article features, let’s have a look at the “recommended article” component:

a screenshot of a “recommended article” component on FT.com, linking to a “News in-depth: Pensions crisis” article titled “UK regulators call for ation on hidden leverage threat to pension funds” — This is just a screenshot, I’m not recommending you this article

In the upstream Internal Content API, this is modelled as an <ft-content url="https://..."> XML element that references the API URL of the article, contained in a <recommended> element.
Our service that ingests content into ElasticSearch (next-es-interface) transforms these elements into an <aside class=”n-content-recommended”>, containing a link inside an unordered list (this is the markup that used to be rendered on FT.com, before 2017). This is what’s stored in ElasticSearch.
Our service that renders article pages (next-article) then finds this n-content-recommended markup, fetches the article it references from ElasticSearch, and uses that data to render a “teaser” component, completely replacing the markup that was present in ElasticSearch.

For just this one component, the data source is generating old markup only for the user-facing service to throw it away, fetch some data, and render something else. And many other components were doing similar things; there’s different versions of markup for the same components fossilised across the entire pipeline.

We’d been trying to simplify things by understanding how components were put together, but that proved impossible. So we took a step back, and realised: most components have these intermingled phases of fetching data and rendering markup, across multiple systems and codebases. That’s the thing that’s making things complex and difficult to maintain. The big picture looked like spaghetti, but looking closer, the layers on layers of the same things happening throughout the pipeline in multiple places were starting to resemble lasagna.

Separation of concerns

So, let’s make these phases explicit, and split them out. With a clean separation between these concerns, we can get rid of the legacy of fossilised older versions of the markup and data, and rebuild components from scratch in a much more maintainable way.

For the recommended article example, our API would be responsible for fetching the data about the linked article, and returning it the same response, so we need some way of including additional data for a component in a way that was compatible with GraphQL.

To implement this, one hurdle we needed to overcome was keeping data in a structured form as long as possible before rendering it to HTML, so we could easily fetch new data and associate that with components in the body. So, early on, we made the decision to represent an article body in our API not as markup, but as structured data: as an Abstract Syntax Tree (AST).

This meant stepping away from the paradigm of transforming the underlying XML into HTML and then performing further transforms on that HTML; instead, we’d parse the XML into an AST, keep that represented as structured data while we fetch additional data needed for rendering components, and return it as structured data from the API.

Then, we could build a library of UI components for rendering this AST into HTML to be presented to users. We built the components in JSX, and the library provides a mechanism for replacing any of the components with a custom implementation. This lets downstream services with different rendering requirements customise the rendering, without having to transform HTML (or requiring our API to transform HTML for them, and therefore making it be aware of every downstream product requirement from now until forever).

Trees and references

Since our API uses GraphQL, fetching the additional data for components is driven by clients querying for certain fields, which the GraphQL server parses into “resolvers” to call. To fetch the data for something like the teaser for a recommended article, a client would have to query that field within our new structured body:

query {
  content($id: "146da558-4dee-11e3-8fa5-00144feabdc0") {
    body {
      recommended {
        teaser {
          title
        }
      }
    }
  }
}

Something like that query structure would work if the body were a flat array or object. But, it’s not! It’s an Abtract Syntax Tree. It’s an arbitrarily nested object structure, and you can’t query within that kind of structure with GraphQL.

At this point we were starting to reconsider our early decision to go with GraphQL; although it fit many of our other requirements, if our data wasn’t easily representable in that paradigm, we’d need to rethink. We’d known we’d have to revisit some decisions made in the early phases, when we were experimenting just to get ourselves moving. But with GraphQL, our early productivity and talking to other teams had lulled us into a false sense of security. Being able to represent an article body was make-or-break.

We experimented with representing the AST as a flat list of top-level article components, which allowed the component data to be queriable in GraphQL, since an array of objects can be queried with the same syntax as a single object. This approach felt like it was going to work, but as we started trying to build the more complex components supported in FT articles such as multi-column layouts, we ran into the same issue, that content we might need to query additional data for, such as links or images, could be nested at an arbitrarily deep level of components.

After some research into how other, similar projects used GraphQL for queriable content within a nested tree, we hit upon the solution of splitting apart the content into a nested, opaque, unqueriable tree, and a flat array of queriable “references”. For any content in the AST that we knew could be queried for additional data, we could add an attribute to that node pointing to a position in the references array, making our data structure look something like:

query {
  content($id: "146da558-4dee-11e3-8fa5-00144feabdc0") {
    body {
      tree
      references {
        ... on Recommended {
          teaser {
            title
          }
        }
      }
    }
  }
}

This data structure worked very well, and fit nicely with our UI component library, which could grab the data from the array when rendering a component. Having this last piece of the puzzle in place gave us a lot more confidence about the decision to use GraphQL.

Fixing things further upstream

The Financial Times in-house CMS (Spark) and the Content and Metadata APIs that sit upstream of the FT.com content pipline are maintained by two separate teams outside of our parent team (Customer Products), spread between the London and Sofia offices. So, thanks to Conway’s Law, these systems have historically treated each other as black boxes, and in many cases had different terminology and data structures for what should have been the same data comprising an article.

While we could have done the work within Customer Products to clean up our own APIs and call it a day, we knew we had a unique opportunity to work together with the upstream teams and get the data fixed all the way to the source. We’d put together a team of fantastic engineers and spent months working to understand the current state of the data, and it would be a waste to keep that knowledge and context within our very temporary team.

Historically, if the data in the upstream systems wasn’t quite right for our needs in Customer Products, we’d just transform the data within our own systems, leading to further complexity in our data and making it much harder to understand the provenance of the data between systems. So, we’ve been working closely with those teams to unwind some of that historical data munging and move it upstream.

Work is also underway to move the generation of the body content AST into Spark, which already represents the content as structured data (it uses ProseMirror). Its AST is then serialised to XML when it’s published to the Content API. So rather than throwing away information about the article when publishing it and having to recreate that when rendering it to users, let’s keep the body represented the same all the way through. This will also have the nice side effect of making the article preview in Spark much closer to how an article actually appears to users, since it’ll be able to use our UI components to render the same data.

But won’t we run into the same problem in another six years?

The main problem we faced before, and at the start of, the project was a lack of point-in-time documentation about the decisions that went into how it was built. That meant we (and any developer over the last six years trying to build a new article feature) had to recreate that context from scratch.

To make sure this (hopefully!) doesn’t happen again, we’ve been thoroughly documenting our decisions in “Technical Decision documents”.

a truncated screenshot of a technical decision document for the decision to use GraphQL

Over the last six months, we’ve written almost 30 documents like this, for decisions as wide-ranging as the one to use GraphQL, to as small as how we pass in a CLI flag to make Jest work correctly with ECMAScript modules.

These documents set out the scope, goals, and non-goals of the decision, explore alternatives, with the initial purpose of reaching consensus in the team, but a secondary purpose of being a long-term record of the context and decision-making that went into this project.

Decision documents aren’t everything; they’re only meant as a snapshot to understand the background of our new system. Now the whole pipeline is in a single repository, it’s much easier for us to document it holistically. We’ve created new architecture diagrams, thorough READMEs for every new system and library, and documentation of the high-level concepts involved and how to build new features in a consistent way.

Architecture diagram for “cp-content-pipeline” — Look I know this looks more complex than the other diagram from the old system but trust me that was one of eight diagrams and they were all as bad or worse

What’s next

This was a temporary team, and we’ve more than met the goals we originally set ourselves. And of course, there’s still more to do. The new pipeline is available to test on FT.com behind a feature flag, and we’ve got long running branches for testing it on the mobile apps. We’re a long way from our new pipeline being used in production to render articles for every user.

Some of the team will be continuing this work next year, as part of the Customer Products Platforms team, which owns shared services and developer tooling for teams across Customer Products. We’ll be trying to keep the momentum and the knowledge from the team going. And like all the best projects, once we’re done and everything’s switched over, if we’ve done it well, nobody will notice.

Thanks to Anna Shipman, Arjun Gadhia, Glynn Philips, and Nick Ramsbottom for reviewing this post.