A Standard for Rich-Text Data

Published in

Content Uneditable

9 min readDec 2, 2016

I have noticed way too many attempts to solve a classic problem faced by all systems that produce rich-text: the raw format to use to save such data. My attempt with this article is proposing a standard way that can be effectively adopted for that purpose.

What Is Rich-Text?

Rich-text is any kind of textual content that can have media, semantic or formatting features embedded.

There are many different uses for rich-text. It is common to see it when people attempt to freely write or communicate a message. The article you’re reading right now is rich-text, just like comments or chat inputs that can accept bold for styling words, for example.

The problem with rich-text is that all non-textual information present in it must be somehow described so it can be saved as an integral part of the text. The most common approach for this is using a markup language where such additional attributes are mixed inside the text.

Here it comes, the root of our problem. There are dozens of “common” formats out there like HTML, Markdown, BBCode, Textile, XML, TeX, Wiki Markup, etc. People also tend to bring their own extensions to such markups when implementing their systems, making the problem even bigger. After all, we’re developers… introducing new standards fires the natural “why not?” approach.

Who cares?! What’s the problem with it?

Not taking into consideration the limitations imposed by some markup languages, the most important problem with this is interoperability.

The fact is that content produced by end-users of our systems can be used in different contexts, for many different purposes. For example:

In a web page, which nowadays must be responsive and adapt to different devices.
In a native application.
In e-mails, either as content or notifications.
In printed material.
In server-side search systems.

Added to the above, we must also take into consideration that data produced in one application can be used by another one. Finally, machines, like search engines, may “read” the text. They must be able to understand it and to retrieve the additional value that rich-text provides.

Summing up, deciding for the right data format is a non-trivial and critical task.

Rich-Text Editors (RTEs)

To start off, let’s assume that the old concept that RTEs or WYSIWYG editors are “enemies” is not a concern anymore. That a well configured, modern editor, like CKEditor, is able to keep authors under control, providing them all the necessary features to stay focused and write quality content.

RTEs are known for using the browser infrastructure for editing as the base of their architecture. The madness of contenteditable has already been discussed extensively out there. That’s the reason why many quality RTEs avoid using, or override, most of the native browser features.

Still many people promote the idea that writing an RTE is a piece of cake, just because of such native features. As a result you have very low quality on the data produced by such editors. Stay away from them!

HTML — A No Go!

HTML is the native “language” of web pages. It’s the way we tell browsers what to show to their users. Therefore, considering that we produce text inside browsers, usually targeting browsers as our channels of distribution of such text, it is natural to have HTML as our first option for the data format used to save content.

There are many issues with HTML though:

It is hard for an author to write it… I mean pure HTML. It requires technical skills.
It may be harder for a person to read and understand it.
It is easy to make mistakes, like missing a closing tag, and have the text destroyed.

Additionally, if low-quality or misconfigured RTEs are used, the HTML produced by them can be extremely ugly.

Conclusion: HMTL is for programmers, not for “normal people”.

Markdown and Similars

As a way to mitigate the problems with HTML, lightweight markup languages have been created. The most notorious example is Markdown, but many others are available, like BBCode, Textile and all the variations of Wiki Markup.

Markdown may be an option when you want to avoid RTEs, giving access to the “source code” of the text to end-users.

There are problems with Markdown as well:

Requires technical skills as one needs to know its syntax.
It’s easy to have users making mistakes, like not using empty lines to separate paragraphs.
Many of the problems of HTML are still present, like having to use complex syntax for non-trivial content like links, images, tables.
Falls back to HTML in any case, if richer content needs to be created.

Conclusion: Markdown is for programmers, but “normal people” can usually handle it in simple scenarios.

Pitstop 1: “Normal People” and “Source Code”

It’s 2016. It is clear already that we’re past the early ages when programmers were creating applications for programmers. Especially when we’re talking about CMSs.

Nowadays, we talk a lot about “semantic web” — about accessibility. We raised new experts called “Content Architects”, who are meticulously designing all details for optimized content creation.

Considering this “evolution”, is it a sane decision to have non-programmers, or “normal people”, to have access to the “source code” of the text they produce? Shouldn’t they be simply focused on writing, having simple tools to intuitively add extra value to their content?

The answer may be obvious and here’s where technology comes to the rescue — it’s the re-birth of RTEs.

JSON and Custom “Data Models”

The proliferation of JavaScript and the maturity of its developers community, started enabling the development of much more ambitious in-browser solutions. That’s the case for RTEs, where we can see a few innovative solutions popping up on the market.

The graphical representation of the CKEditor 5 Data Model

CKEditor 5 is a good example. It doesn’t use the DOM anymore as its data model. Instead, a customized data representation totally defined and controlled with pure JavaScript appears. This new approach moves the “source code” of the text totally away from the browser. HTML disappears from the scene.

One proposal that we’ve seen out there recently is that, as we don’t have to think about users having access to source code, why not use this new, perfect data model to save content. It’s just about taking the data model from memory and stringifying it. The result: JSON.

It means removing all drawbacks of HTML or Markdown and talk “natively” to JavaScript. It should be limitless, so that anything could be saved then.

The biggest problem with this, other than the total unreadability of this data format, is that it is a guarantee for interoperability issues. You’ll hardly find system or conversion tools that will allow you to use such data in other places.

Conclusion: JSON is for machines. Not even programmers can handle it.

Pitstop 2: Conclusions So Far

We’ve set some important design decision:

NEVER give access to “source code” to authors.
DO USE a high quality, well configured RTE. Let authors focus on writing.
DO PRODUCE rich-text. Really rich, with the additional semantic value, media references and anything that makes content much more than just words.

The above should help us with the final conclusion.

HTML —The Way to Go!

One of the big plusses of HTML is that it is a well-defined standard with well-known syntax. Out of the box, it covers many semantics. It includes accessibility features. No need to reinvent the wheel.

HTML is limitless, in the same way that it is extensible. Anything which is not defined in its extensive list of elements can be easily appended by using data attributes or custom elements.

HTML is Interoperable

Note that I’m proposing here HTML as the data format to be used to save rich-text. I’m not in any way saying that such HTML is the way to then render the data. This means that it can really contain anything.

To clarify this, let me bring in an example. Let’s suppose that we want to include tweets as part of our content, just like Medium does. For example this tweet:

https://twitter.com/Interior/status/463440424141459456

If we think about HTML as simply a rendering language, Twitter would let us know how to “embed” the above tweet. You would endup with something like this:

<blockquote class=”twitter-tweet” data-lang=”en”><p lang=”en” dir=”ltr”>Sunsets don&#39;t get much better than this one over <a href=”https://twitter.com/GrandTetonNPS">@GrandTetonNPS</a>. <a href=”https://twitter.com/hashtag/nature?src=hash">#nature</a> <a href=”https://twitter.com/hashtag/sunset?src=hash">#sunset</a> <a href=”http://t.co/YuKy2rcjyU">pic.twitter.com/YuKy2rcjyU</a></p>&mdash; US Dept of Interior (@Interior) <a href=”https://twitter.com/Interior/status/463440424141459456">May 5, 2014</a></blockquote>
<script async src=”//platform.twitter.com/widgets.js” charset=”utf-8"></script>

Well, the above is a mess. It doesn’t clearly tell me that it is a tweet (ok, we got a class name, uh). It is static as it dictates how I want to represent a tweet (by injecting a JavaScript file that will do that job for me).

Ok, but now let’s think about HTML as the way we want to save data. For that, we certainly don’t need all of the above. We just want to register the user intention, leaving the rendering problem for later. In other words, what about having the following instead?

<tweet url="https://twitter.com/Interior/status/463440424141459456"></tweet>

Or even a much more generic way to embed media, so we could use it for YouTube, Instagram, etc.:

<media type="tweet" url="https://twitter.com/Interior/status/463440424141459456"></media>

Much better and HTML allows for it.

With the above we’re clearly saving, as part of our data, everything we need to know about the author intention. This is purely semantic and doesn’t dictate the way we’ll then present this data to our readers.

Then, whenever we want to use that data, it must be converted to the format that fits better the target medium. For example:

If we’re outputting to the browser, custom elements (like <tweet> or <media>) must be converted to more complex HTML representations (which may vary, per device).
If we’re including it into a plain-text e-mail, it must be converted accordingly.
If we want the so-called “power-users” to access source code, we may transform it to markdown, if you really want it. The benefit of it over RTE is dubious, though.

Of course, the nice thing is that, if I don’t extend the HTML dictionary with custom stuff, I can just use the data as is to render it inside browsers, which fits the great majority of cases. Actually, with the support for web components and custom elements, web browsers will be able to render even my custom stuff!

Good news is that HTML has been a standard for a long time. Almost every language and environment provide ready to use libraries to parse or convert from/to HTML. Additionally, many systems “understand” HTML natively.

RTEs Produce Quality HTML

Many RTEs are still using the browser DOM as the data model that holds the text and its rich attributes. Their features are constantly making manipulations to the DOM. When data is saved, the DOM is simply output as an HTML string.

Quality RTEs, like CKEditor 4, are based on such architecture too. The difference is that they have total control over the output, not simply taking the DOM as is, but instead transforming it into quality, semantical HTML.

RTEs that use a custom JavaScript data model instead, like CKEditor 5, have even more control on how the data is output. They can, in fact, produce any kind of data format. Here HTML is definitely a good option.

Therefore, stating that RTEs produce bad quality HTML is pretty outdated. Everyone should promote the opposite, for the benefit of content authors (and the whole humanity! — because why not?! :D)

Disclaimer

My company, CKSource, proudly created CKEditor, maintaining it as the best RTE out there for more than 13 years already. We (try hard to) never talk about competitors, as a common principle, so I didn’t mention them in my text. In any case, all the above doesn’t exclusively fit the reality of CKEditor. Hopefully your editor of choice will fit as well.