The Problem with HTML in React and GraphQL
In case you did not know, the New York Times is re-platforming:
Behind most dynamic websites on the internet is a data store. It doesn’t matter what kind of data store. If that data store contains the content of an article or a blog post, chances are that data contains a field or several whose value is a string and/or HTML. Not always. There are text-only posts, frameworks that insert dynamic content into text, frameworks that insert placeholders that become HTML; but more than likely, when rendering content by an author who has entered their storytelling into a beautiful <textarea>
, you have to be ready to output HTML.
This used to be easy. Using PHP, it’s echo $content
and ship it. When using JavaScript and React/GraphQL, we are presented with new challenges.
GraphQL resolves fields specified by a query. Without much thought, one of those fields might be called content: String
. This content can be just plain text, but what if it’s not? To be “safe”, we often do this:
<div dangerouslySetInnerHTML={{ __html: content }} />
Believe it or not, this is the *proper* way to set innerHTML
of a DOM node in React. There’s nothing safe about it, and let’s think a little deeper about the implications of what we just did.
Script Tags
What happens when our HTML contains <script>
s? This is a judgement-free zone, and there are many historical reasons our content might contain HTML. If we support HTML anywhere, we are opening ourselves up to dealing with arbitrary JavaScript. An organization’s Content Management System may contain years of posts, millions of entries. Without the concept of content as structured data, we probably have endless blobs of HTML in there, and some probably contain <script>
tags.
When using React, without properly parsing these scripts out of the content, they will not load. If you server-side-rendered (SSR) your content, it is likely they do run, but then what happens when you call ReactDOM.render(<App />)
on the whole tree? It is probable that whatever your script was doing has now been blown away by the client re-render.
Turning HTML into Structured Data
This whole scenario is a mess for React, but what about React Native? What happens when my content is <p>This is a cool article: <a href="http://tacos.com/best">Best Tacos</a></p>
? React Native has a WebView
component, but do I really want to inject HTML with no style info? Or if I do, how do I retrieve the proper styles to inject? I would rather use <Text>
and the Linking
API when necessary. Even so, do I then need to parse content fields into HTML in every app that I build. That sounds like a lot of work to include in every client, and disparate clients will require different HTML parsers depending on the language being used and the target rendering platform.
When using GraphQL, we can transform the data in one place and then hopefully all apps can reuse the same parsed content. If we cannot expose structured data through REST APIs or the data platforms themselves, GraphQL resolvers are our next best option.
The New York Times is implementing structured data as a pair of types: BlockUnion
and InlineUnion
that are heterogenous and represent n
other types. The HTML parsing is done before the data gets to GraphQL. The heterogenous types can create extremely complex GraphQL fragments that create very large queries when composing them.
I implemented content as structured data in my graphql-wordpress
monorepo. The repo contains a GraphQL server, graphql-wordpress
, that builds its schema from the WordPress REST API. I wrote a WordPress plugin, graphql-wordpress-middleware
, that adds a field to the REST API schema for Posts. Out of the box, the WordPress REST API gives you 2 formats for your content: raw
and rendered
— the rendered version is HTML, complete with <p>
tags and embedded media (YouTube, Twitter embeds, whatnot). The raw
version is meant for editing, which is more agnostic to rendering platform, but does not contain the canonical representation of the author’s final intent. WordPress does not expose structured data, so we need to create our own. That is what my WordPress plugin does. It adds a 3rd representation, data
, that contains the HTML as a tree structure. Parsing logic here, REST response here. Here is the example structured content:
"content": {
"rendered": "<p><strong>The New York Times</strong>:</p>\n<blockquote><p>The album David Longstreth made after the end of his relationship with his former band mate Amber Coffman uses R&B tech but speaks from the heart.</p></blockquote>\n<p><a target=\"_blank\" href=\"https://www.nytimes.com/2017/02/22/arts/music/dirty-projectors-self-titled-review.html\">Read More »</a></p>\n",
"raw": "The New York Times:\nThe album David Longstreth made after the end of his relationship with his former band mate Amber Coffman uses R&B tech but speaks from the heart.\nRead More »",
"data": [{
"tagName": "p",
"attributes": [],
"children": [
{"tagName": "strong","attributes": [],"children": [{"type": "text","text": "The New York Times"}],"type": "element"},{"type": "text","text": ":"}],"type": "element"},
{"tagName": "blockquote","attributes": [],"children": [{"tagName": "p","attributes": [],"children": [{"type": "text","text": "The album David Longstreth made after the end of his relationship with his former band mate Amber Coffman uses R&B tech but speaks from the heart."}],"type": "element"}],"type": "element"},
{"tagName": "p","attributes": [],"children": [{"tagName": "a","attributes": [{"name": "target","value": "_blank"},{"name": "href","value": "https://www.nytimes.com/2017/02/22/arts/music/dirty-projectors-self-titled-review.html"}],"children": [{"type": "text","text": "Read More »"}],"type": "element"}],
"type": "element"
}]
}
Now we are in business. If I want to render the raw HTML using dangerous React, I can use rendered
. If I want to iterate over the HTML and use React components to create my view, I can use data
and only intervene where necessary to exclude nodes or transform attributes. I can use the same process for React Native and map my HTML tag names to native components like View
and Text
. You can see how I parse nodes here for web and here for native.
GraphQL nodes for Content
I have developed the following schema for freeform content:
# The content for the object.
type Content {
# HTML for the object, transformed for display.
rendered: String # Content for the object, as it exists in the database.
raw: String data: [ContentNode]
} union ContentNode = Element | Text | Embed# An element node.
type Element {
tagName: String
attributes: [Meta]
children: [ContentNode]
} # A text node.
type Text {
text: String
}# A metadata field for an object.
type Meta {
# Name for the metadata field.
name: String
# Value for the metadata field.
value: String
}# An embed node.
type Embed {
version: String
title: String
html: String
providerUrl: String
providerName: String
authorName: String
authorUrl: String
thumbnailUrl: String
thumbnailWidth: Int
thumbnailHeight: Int
width: Int
height: Int
}
Depending on how you are retrieving your fragments, it will look something like this (recursion of unknown/infinite depth is a problem in GraphQL):
#import "./Embed_node.graphql"
#import "./Element_node.graphql"
fragment ContentNode_content on ContentNode {
__typename
... on Embed {
...Embed_node
}
... on Text {
text
}
... on Element {
...Element_node
children {
... on Embed {
...Embed_node
}
... on Text {
text
}
... on Element {
...Element_node
children {
... on Embed {
...Embed_node
}
... on Text {
text
}
... on Element {
...Element_node
children {
... on Embed {
...Embed_node
}
... on Text {
text
}
... on Element {
...Element_node
children {
... on Embed {
...Embed_node
}
... on Text {
text
}
... on Element {
...Element_node
children {
... on Embed {
...Embed_node
}
... on Text {
text
}
... on Element {
...Element_node
children {
... on Embed {
...Embed_node
}
... on Text {
text
}
... on Element {
...Element_node
children {
... on Embed {
...Embed_node
}
... on Text {
text
}
... on Element {
...Element_node
children {
... on Embed {
...Embed_node
}
... on Text {
text
}
... on Element {
...Element_node
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
This looks more cryptic in Relay.
HTML with JavaScript
This brings us back to the issue of freeform HTML that contains script tags. What to do? Most of the time, on the client (in componentDidMount()
), you will need to asynchronously load the scripts using a mechanism that remains to be seen. We do it a lot at the New York Times, and the script loader looks something like:
const loadScriptOnce = (url, opts = {}) =>
new Promise((resolve, reject) => {
const addListeners = script => {
script.addEventListener('load', () => {
script.dataset.loaded = true;
resolve(script);
});
script.addEventListener('error', e => {
script.dataset.loaded = true;
reject(e);
});
return script;
}; const script = document.querySelector(`script[src="${url}"]`);
if (script) {
// if there's already a script on the page,
// and it's loaded, resolve immediately
if (script.dataset.loaded) {
return resolve(script);
}
// script is on the page but hasn't yet fired onload
return addListeners(script);
} const { asyncProp, deferProp } = Object.assign(
{
asyncProp: true,
deferProp: true,
},
opts
); // otherwise, it hasn't been added to the page yet,
// so add it and wait for it to load
const s = document.createElement('script');
s.src = url;
s.async = asyncProp;
s.defer = deferProp; document.body.appendChild(s);
return addListeners(s);
});export default loadScriptOnce;
For any custom scripts you might have in your content, you probably need to load them this way: loadScriptOnce(scriptUrl).then(pray)
. How you parse the scripts out of your content is your business. Most embeds don’t bring along JavaScript these days, but old posts in your CMS might contain all kinds of goodies.
Even if you do initialize your scripts, you also need to be aware of what happens when you have a single page app (SPA), and you navigate between routes and componentDidMount()
is not called (but componentDidUpdate()
is), and your content needs to reinitialized. Depending on how many types of embeds you have, or how much technical debt exists in your CMS, adding code to handle the content can go from “not a big deal” to “an unfunny nightmare.”
Platform-specific JavaScript
A problem we have at the New York Times is content that was written with a certain render target in mind. We have thousands of embedded modules that assume they are loading in a PHP framework that loads the necessary dependencies in the <head>
. Or, they assume a RequireJS map is available, and they want to initialize their code by first calling require(['foundation/main'], () => ...)
, which only exists in a certain platform context. There is not a lot we can do to “fix” this content. We could rewrite each module by hand, but that might take months or years. We could load the extra scripts they need in the <head>
, but then we are just moving tech debt from one platform to another.
We have opted to load these “legacy” modules in our React environment in <iframe>
s that have the proper JavaScript context that the modules need, which is essentially reverting our UI to the days of Flash and Java applets. They work, for now…
Platform-agnostic JavaScript
A big portion of our freeform HTML is actually some of our most popular content. The New York Times is known for its compelling journalism, and is often lauded for its interactive and visual storytelling. Many of our assets, like charts, graphs, and embedded modules, are built to render on new platforms, old platforms, WebViews, the works.
In our move to React, one of the most challenging tasks we’ve had is to figure out how to take these generic modules and load them using React, on the server and client, without completely destroying them in the process. Since we do SSR, our page should render as fast as possible in the browser, without any unnecessary redraws and hiccups. Things like D3 charts and auto-play video should be uninterrupted in this process. These things do not always play well with React and isomorphic rendering. We’ve even raised an issue on the React project to discuss this dilemma.
Media-heavy Sites
One of my favorite parts of WordPress is its magical parsing of “embeds” — paste a YouTube video URL in the editor, and voila, you have a YouTube player on the front end. This works because WordPress is pinging oEmbed services on Save and storing the data for that embed. When your post renders, a filter runs on the content that replaces the URLs with the embed’s HTML code, usually an <iframe>
. This is great in a lot of ways, but chances are, for media-heavy sites, your users are only clicking on a fraction of those embeds, if any at all.
I also noticed, while building Relay and Apollo apps that use my graphql-wordpress
project, that the loading of said embeds was really slowing down the rendering of my routes. Especially when navigating as part of a single page app. I thought it would be nice to load a placeholder instead of the media’s HTML embed code, to address the weight of the page and allow the site to load quickly.
If you remember the Embed
type above:
type Embed {
version: String
title: String
html: String
providerUrl: String
providerName: String
authorName: String
authorUrl: String
thumbnailUrl: String
thumbnailWidth: Int
thumbnailHeight: Int
width: Int
height: Int
}
These values are what services that support oEmbed return. So, when you are iterating through content.data
and rendering React components, you can load thumbnailUrl
as a placeholder. Most of the time, the image is big enough to support your site’s intended design.
I wrote more about this here: https://github.com/staylor/graphql-wordpress/tree/master/packages/graphql-wordpress-middleware#oembed-extensions
Epilogue
We are still in the midst of these challenges. I am curious if others have solved these problems, or experienced them in the same way. I’ll continue to share our findings as they emerge.