Markup: Getting Meta with HTML

My cousin, The Cagle (actually Dr. Lauren Cagle … long story), is a professor of communications at the University of Kentucky. Actually, come to that, a significant percentage of my family is involved in the field of communications, including my wife who writes Internet copy, a brother who writes role playing adventures (and teaches RPG design) for game companies, my father who runs a communications guild, my mother who was a trainer for a large children’s organization, several teachers, training nurses, a couple of lawyers … yeah, there’s a theme going on there: we tend to speak our mind.

Anyway, the professor: she contacted me the other day about a discussion she’d had with one of her classes, talking about hypertext and paratext. As this is a topic that is near and dear to my heart, my response was, naturally, to start writing an article about it. You’re reading it (self-referentialism for the win!).

It used to be, in the not too distant past, that HTML was actually taught at the high school (or even grade school) level. The language originally emerged as a way to mark-up science abstracts, but was also simple enough that it could be used for other purposes like cataloging one’s cassette tape bootlegs online (CDs were just really beginning to come into their own in the early 1990s and CD writers were big, expensive, bulky boxes).

HTML took off as a form of what has been around in electronic forms since the 1960s, and in analog form since the invention of the margin: mark-up. As publishing became a thing in the roaring Internet of the 15th century, there was a realization that authors occasionally were guilty of making mistakes. With printing came a revolution in the art of making paper cheaply. This meant that a scrivener could take an author’s manuscripts and write on them in preparation for the laborious process of typesetting. By the 17th century, this scrivener was often making minor (and in many cases major) changes to the manuscripts, and increasingly had a say in what actually got to press in the first place. This is where editors came from.

It was rumored that such editors would leech the blood of the author’s children if said authors didn’t get the works promised in good form and use said fluids as the basis for a kind of red ink, which is one reason that to this day editors use red pens. This may be apocryphal, but it’s never been denied, either.

By the 20th century, there was a lexicon of symbols and notation that had evolved to allow an editor to quickly get their intention across (along with the occasional skull and cross-bones). Because such pages were outside of the normal purview of the words that would get printed, the pages were said to have been marked up, and the notation itself became known as markup.

As typesetting moved into the electronic realm, such markup symbols were encoded into the text itself, with the processor being instructed to change formats or layout based upon this code. This in turn introduced certain characters that were “escape” characters that were ignored by the processor, typically brackets or braces. Such markup might look something like:

[fs12][fnChanceryItal]
[para][drp]T^apos;[/drp][ucase]was[ucase] brillig, and y[sup]e[/sup] slithy toves[br]
did gyre in y[sup]e[/sup] wabes.[br]
All mimsy were y[sup]e[/sup] borogroves,[br]
and y[sup]e[/sup] momraths outgrabe.
[para]...

This in turn would get printed by the typesetter as:

First stanza of Jabberwocky

Notice that none of the square bracketed content appears.

HTML as originally envisioned was not designed for deep, precise typography. It was instead a way of describing scientific abstracts — bodies, definitions, citations, paragraphs. There was even concern about adding highlighting, with the compromise being <em> for emphasis and <strong> for additional emphasis. It was only as HTML began to be adopted that people started replacing <em> with <i> for italicize and <b> for bold. The image (<img>) tag came in fairly late (not really appearing until Mosaic, the precursor to the original Netscape browser).

This distinction occurs because there were in fact two distinct requirements that people had. For Tim Berners-Lee’s original use case, what was needed was a way of describing abstracts, and the idea that it would eventually used for creating newspapers was not even on his radar back in 1990. He was creating a semantic description of content, not a page layout language.

Yet as people began using the new http protocol, they found that their use cases were different. Universities wanted to showcase their presence on the emerging web. College students wanted to put together lists of their music. Somewhere along the line, someone came up with the first porn site, likely from scanned Playboy magazines. These were not scientific abstracts. Given that the intent of the language was not clear, it also meant that the early web was given form largely by bending what should have been a way to talk about scientific papers into something altogether different.

It is this distinction between semantics and presentation that has created a tension about the languages of the web. The divergences in approach become even more complex once you start treating the web as an application platform. The semantic approach states that you are describing a type of document, and in many respects it is this approach that proponents of XML have pushed since the inception of that language. Presentation then becomes a secondary language, one that is bound to the semantic bones through a selection mechanism — if markup is a paragraph, then a selector language such as cascading style sheets can say that all (or just a certain) paragraph needs to be rendered with a type style of 12 pt Helvetica with a 14 pt line to line spacing and a 10 pt bottom margin.

The markup of HTML (and by extension XML) is similar to the boxed approach above, save that it uses <angleBrackets> instead.

<p><span class="dropcap">T</span>&apos;was brillig, and y<sup>e</sup> slithy tove<br/>
did gyre in y<sup>e</sup> wabes.<br/>
All mimsy were y<sup>e</sup> borogroves,<br/>
and y<sup>e</sup> momraths outgrabe.</p>

The <span> element identifies a contiguous sequence of “inline” text, and usually utilizes either a class or and id tag to establish yet another layer of semantics distinct from the text content itself. The formatting (for the example of) then exists as a separate document in the cascading style sheet format:

@import url('https://fonts.googleapis.com/css?family=Tangerine');
p {
font-size:30pt;
font-family:Tangerine;
}
.dropcap {
color: #903;
font-size:70pt;
float:left;
padding-right:14pt;
}

Here, there are three distinct sections — an @import “directive” for importing a font family, a p section that matches the paragraph marker, and a .dropcap class that performs the layout for the dropcap itself. It’s worth noting that this changes presentation, but not content, as can be seen by disabling the CSS:

Hypertext is information that relates two documents. HTML’s biggest innovation was the introduction of both the <link> tag, which generally creates a direct link between the whole document and another document and the <a> tag, which creates a one way link from a section of the current document to another document. Normally, the name (or more recently) the id attribute when used with this tag creates a specialized “anchor” so that another document can point to a document subsection. Otherwise, the <a> tag acts as a link, using the http: protocol to retrieve content specified in the href attribute.

<h1>Poems of Lewis Carroll</h1>
<ul>
<li><a href="https://www.poetryfoundation.org/poems/42916/jabberwocky">Jabberwocky</a></li>
<li><a href="https://www.poetryfoundation.org/poems/43914/the-walrus-and-the-carpenter-56d222cbc80a9">The Walrus and the Carpenter</li>
</ul>

Other tags embed content, usually using the src attribute. This includes the <img>, <script>, and <style> elements. These serve to embed files of a certain kind within the document model, as well as provides hints for how to process those resources. These are still metadata, though in this case the metadata is computational rather than semantic.

In 2000, a proposal was made to extend the hyperlink capabilities of HTML through the use of the XLink standard, though ultimately it was rejected by the browser vendors of the time. XLink was one of the last efforts towards trying to create semantic metadata that not only built associational links, but also provided ways of using that metadata to create an idea about what existed on the other end of the link. Tim Berners-Lee spearheaded a new effort called the Semantic Web, which moved beyond a fairly limited community with the publication of an article in 2004 on The Semantic Web for Scientific American.

While it hasn’t been heavily adopted, the Semantic Web is making its way onto the web through the use of a language called RDFa, short for the Resource Description Framework for Attributes, established by programmer (and now novelist) Micah Dubinko and entrepreneur Mark Birbeck in 2008. It makes use of HTML attributes to identify resources that have specific global identifiers. A specialized program called GRDDL can read the HTML and from that retrieve summary information.

This layer is distinct from the HTML. For instance, the following (from the RDFa primer) showcases a typical RDFa fragment describing a person’s social network:

<div vocab="http://xmlns.com/foaf/0.1/" typeof="Person">
<p>
<span property="name">Alice Liddell</span>,
Email: <a property="mbox" href="mailto:alice@example.com">alice@example.com</a>,This
Phone: <a property="phone" href="tel:+1-617-555-7332">+1 617.555.7332</a>
</p>
<ul>
<li property="knows" typeof="Person">
<a property="homepage" href="http://example.com/dormouse/">
<span property="name">Dormouse</span></a>
</li>
<li property="knows" typeof="Person">
<a property="homepage" href="http://example.com/mad_hatter/"><span property="name">Mad Hatter</span></a>
</li>
<li property="knows" typeof="Person">
<a property="homepage" href="http://example.com/cheshire_cat/"><span property="name">Cheshire Cat</span></a>
</li>
</ul>
</div>

This block of HTML text with embedded RDFa would appear as a simple paragraph and list in HTML, but GRDDL can read this and use it to create a graph as follows:

This uses the friend of a friend (FOAF) specification for identifying relationships within social graphs. This is hypertext content, because it describes the relationship of items within a web page to items within other web content, and moves beyond the use of simple tagging (which typically requires some form of consolidating database) into the realm of giving awareness to other documents.

What’s of even greater interest in such semantic documents is that computers can read this RDFa and establish relationships. You know from this metadata that the thing being referenced is a person, that this person has a name, and even how to contact this person. If you were to crawl over this site, it would be possible to create a graph that shows the world from the perspective of the mad hatter, getting not only detail about him, but also information about who knows him.

In other words, in the aggregate, reading through RDFa, it becomes possible to create a knowledge graph, a description about a particular domain of information that can be navigated based upon web addresses. It becomes possible to show how Alice Liddell is related to Kevin Bacon. It provides ways of showing what media the character of Alice Liddell has appeared in, and can even be used to gauge when the eponymous Alice is trending.

Just as most HTML today is no longer hand written, RDFa is very seldom hand written. It can be added via certain editors, though these are not widely distributed. More often the RDFa is added via filters that identify names, places, events, diseases and other similar concepts and then map them to specific global identifiers. This process, called entity enrichment, is increasingly making it possible for pages to describe themselves to machines as well as to people, to create better search experiences, and to control user interfaces in web applications. In effect, the role of RDFa and enrichment is to make pages aware of their own content, and to provide a cohesive descriptive process for applications to navigate across information spaces without necessarily explicit links.

Services such as dbpedia, and schema.org can be used to help analyse text content and build relational identifiers for the enrichment. At a minimum, the HTML <link> and <meta> tag can also be used to describe such semantic relationships about the whole page:

<div id="watch7-container" itemscope  itemtype="http://schema.org/VideoObject#">
<link itemprop="url" href="
https://www.youtube.com/watch?v=0gZLxFzd9SA">
<meta itemprop="name" content="150 Years Later - Wonderland">
<meta itemprop="duration" content="PT1M10S">
<meta itemprop="unlisted" content="False">
<link itemprop="embedURL" href="
https://www.youtube.com/watch?v=0gZLxFzd9SA&amp;autohide=1&amp;version=3">
<meta itemprop="playerType" content="Flash">
<meta itemprop="width" content="640">
<meta itemprop="height" content="480">
<iframe width="854" height="480" src="https://www.youtube.com/embed/0gZLxFzd9SA" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
</div>

In this case, the <link> tag identifies a URL (or URI in the semantic vernacular) while the <meta> tag identifies an atomic property. Note that this information doesn’t necessarily correspond to a physical UI representation (though it could) — it exists as metadata about an item. As with RDFa, this representation (used by Google for it’s page ranking algorithm) provides a way to incorporate machine-readible semantic data into web pages.

The takeaway from this is simple — web content is multi-layered, with text and meta-text building a rich tapestry of presentation and semantics at both the human and the machine level. Typically there are multiple filters that get applied as well, from Javascript transformations affecting overall layout to XSLT that makes it possible to convert from other semantic markup languages such as DITA or TEI into dynamic HTML representations. The combination of these semantics with specialized tools for identify concepts also makes it possible for other filters (typically browser based) to identify associations using shared linked data.

Curiouser and curiouser ….

Kurt Cagle is a writer, blogger, and software engineering specializing in data semantics, data science and governance. He lives in Issaquah, WA with his family and cat, who has a most peculiar grin.