Connecting with real-world entities:
is structured content missing a trick?

(I posted this two years ago on Google+. Since then, has increased interest in semantic Web technologies in general, but still not much is happening around inline references to real-world entities.)

  1. One of the most promising and least used aspects of structured content is the ability to associate inline elements unambiguously with the real-world people, companies, and things they describe (think Facebook mentions but much more powerful and outside that walled garden).
  2. When using inline mentions in this way, it’s better if the markup format is not tied to specific applications or actions that should be taken by the system.
  3. The best place to unambiguously reference an entity is in the source content.

Associate inline elements unambiguously with real-world entities

Here are some of the many possible uses for this kind of semantic tagging:

  • For names which are trademarked in some countries but not others, you can get the appropriate trademark symbol (or none) showing up for each language or country output, on the first mention, on all mentions or according to any other requirement.
  • For documentation containing strings such as UI text or programming language keywords, you can keep them in sync with the canonical source.
  • You can ensure coverage of all relevant product features, or compare feature mentions with customer searches related to those features.
  • You can enable more effective search for relevant content (including allowances for synonyms and misspellings). This works for content creators and consumers alike.
  • For terms where links to definitions or other relevant content would be useful, you can enable such links. For one vision of this, see Mark Baker‘s article Re-Thinking In-Line Linking, including the discussion in the comments (though note that my point 3 diverges from this on the best place to disambiguate entity mentions).
A customized Web application based on Mekon DITAweb
  • You could perhaps base conditional filtering on the features mentioned in a block element. If a feature isn’t in the current product configuration, the whole block could be filtered out.
  • You could expose this kind of metadata in web content, allowing external applications to use it (and preparing for the increasing capabilities of search engines).

It surprises me a little that inline markup for these purposes isn’t more common. Inline markup is very common for formatting purposes — bold and italics being the classic examples! This is obviously rather limited. Inline markup’s also commonly used to indicate types of information or types of entity (much of the “out of the box” DITA vocabulary is like this). But not every structured content implementation attempts to refer to specific real-world entities at the inline term or “mention” level.

Don’t tie markup to specific applications or actions that should be taken by the system

There are of course plenty of examples where inline markup is associated with real-world entities, especially in medium or large organizations. Conditions represent features; custom implementations link keywords to their canonical sources; and tricks with attribute values get trademarks displayed correctly. But the markup is often different depending on the specific application for which it’s used, limiting later extension. For example, there may be a conditional attribute to mark information about a feature that is present in some product configurations and absent in others. Yet if the feature name is trademarked, there may be a different attribute to ensure that only the first mention of the name gets the TM symbol. And if the feature name is used as a UI string, this may be different again. Once the content is being localized, still different markup may be used for terminology management. This seems to be an unfortunate intermingling of concerns. In an ideal world, it seems that the source content should describe what it’s about, without locking this description to a particular action to be taken by the system. That leaves the system free to take any action (or multiple actions) based on the rules for the particular context.

Resolve ambiguity in the source content

This third point is about where to disambiguate terms / mentions. There’s a valid argument that specific attribute values are not needed to indicate real-world entities; the content of an element is enough. For example,

<organizationname>Acme Corporation</organizationname> 

refers to the organization named “Acme Corporation”, and can be processed by a system accordingly. If an term is mis-spelled or an alternate term is used, the system can prompt authors to use the approved version — or in the case of a synonym, perhaps the system can silently do the appropriate matching, leaving the synonym visible in the output. During localization, terminology management systems can match the strings from the source with their approved, localized counterparts (or skip translation where the term is global). This certainly works for many organizations and I wouldn’t say it’s wrong. But here’s why I think it’s preferable to keep the full, unambiguous meaning of the content in its source:

  • It avoids messy ambiguity, for example where two companies have the same name (there have been a number of Acme Corporations), or where a name is very close and an author accidentally gets the wrong one. It also avoids the need to create specific elements solely for disambiguation, for example if there was an “Exit” UI string on a software tool’s File menu, and another “Exit” string on a particular dialog, and in the canonical string resource these were maintained separately and liable to be updated independently.
  • It means the content is not quite so dependent on a specific system or implementation. It’s easier to make use of the content in other contexts: whether in different tools by the same team/s, other group’s tools, or in external teams entirely. This isn’t the same as saying that we should create content in an interchange format — I think most people have experience that as soon as they start doing anything interesting with their content, they pretty much lose the ability to import that content easily and blindly into a different system without missing something. What I mean, though, is that attaching the full semantics to the source creates the possibility to interpret the source in another system without having to recreate large parts of the original system.
  • It gives authors confidence. Authors like unambiguity. I believe some of the affection for WYSIWYG is because of the desire to “get things right”. The appearance of the content in a WYSYWIG tool gives a sense of confidence (albeit one that isn’t always justified). A better way to get it right is to be confident that your source can be used appropriately. It seems that a true WYSIWYM (what you see is what you mean) tool would give authors confidence that the semantics of their work were correct and would be preserved.

The questions then arise as to what format this markup should take, and how authors should work with it. For the format question, consistency and extensibility seem very important. RDFa comes to mind. It’s machine-readable, extensible, and has the advantage of already working with the major search engines, possibly simplifying the system somewhat. But raw RDFa is not something that most authors really want to get into inserting manually, or that would be productive if they did. Here is where authoring tools can really come into their own. Where there’s a canonical list of external entities or another searchable resource, designs such as the following can be used: After working on a piece of content, you could press a button and get suggested matches for the terms you’d mentioned. During writing, a keyboard shortcut could bring up a dynamic list of items, filtered as you typed.

It’s worth mentioning that sharing a canonical list of entities doesn’t have to imply anything about power and authority in a team. With the right design, new entities could be added by anyone. And the technical architecture need not even be centralized — it could work with a distributed repository as with Git or Mercurial.

At the beginning of the post, I wondered whether I was indulging in architecture astronautics. Based on what I’ve written so far, there is certainly plenty of room to get carried away. But experience tells us that we shouldn’t get so hung up on a particular concept or way of doing things that we lose sight of the practical costs and benefits. I really like Joe Gollner‘s slides on these kinds of concerns. As Joe puts it, we should use just enough semantic technology, and no more.

Of course, just enough for now may not always be enough for the future. And one of the great things about RDF’s use of a graph model (including RDFa of course), is that it is really extensible. Extending a semantic structure or modifying existing semantics is a lot easier than if everything is buried in a relational or hierarchical form. (So it’s ironic that the term RDF has misconceived connotations of idealistic, over-ambitious implementations that try to do everything and rarely realize practical value.)

Open Graph representation by Dan Brickley on Flickr

So, as I see it, inline semantics connected to specific real-world entities have huge and somewhat untapped potential; they should be separated from specific applications where possible; and they are probably best kept in the source. I have no dire predictions for people who don’t do these things, but I think structured content could perhaps be enhanced with them!

Have you used inline markup in this way? Is it something you’re thinking of? Or wouldn’t it be worth it for your situation? It would be great to hear opinions and experiences related to these ideas.

Selected comments, copied from the original post

Mark Baker commented:

Yes, structured content is missing this trick. Glad I’m not the only one saying so. One more use for this kind of markup: it can be used to fully automate linking across your entire content set, eliminate broken links, eliminate all issues with links in reused content, and allow old content to discover links to new content.
The question of whether you should just mark up the running text or use an attribute to specify the object named more specifically is one that has to be settled on a case by case basis. My preference is to use straight markup of the running text whenever possible, to make the markup as inexpensive as possible for writers.
I would restate Joe Gollner’s principle very slightly: we should use all the semantic technology we can afford, and no more. In other words, we should use semantic technology only insofar as it pays us back. The best way to improve the payoff from semantic technology is to reduce the cost of acquiring semantics. Since semantics come from authors, we want to make it as easy as possible for them to supply them. Using natural language processing and other content intelligence techniques on the back end can reduce the cost of acquiring semantics on the front end.
The way you do that is to start your system design with the authoring interface. Most systems today start with the content management problem or the publishing problem, create markup that expresses content management or publishing concerns, and then try to figure out how to get authors to write in those formats. This is the wrong approach is you want to get inexpensive reliable semantics. Instead, you need to start by designing an authoring interface that can get the most semantic information from the broadest population of authors at the least cost. Then you figure out how to use that markup to drive content management and publishing processes. If you get good semantics up front, though, you will never have a problem driving content management and publishing on the back end.

To which I replied:

Mark, thanks for the comments. Good to hear this pretty much tallies with your experience. Indeed, I had all those benefits of soft linking in mind in my bullet on linking, and the reference to your article was supposed to be a kind of shorthand for those, though I realize that other pieces you’ve written go into more detail. (Also, your writing on other areas such as separating content from behavior has really helped me clarify my thoughts.)
We definitely agree that a system should do the heavy lifting for authors, not the other way round. And the specifics of how to capture the semantics should indeed be decided case by case. But I wanted to expand a little on why I think it can be less expensive overall and even more pleasant for authors if real-world entities are unambiguously denoted in the source.
First, an example. When I mentioned your name at the beginning of this comment, I typed a plus, then “M”, then clicked on your name and photo from the popup list (your name was at the top because you’re in my circles and because you were a previous commenter on this post). Your name was then highlighted in my comment in a little blue capsule, so even if I came back to the comment later I would know this mention was entered in such a way that it would link to your profile, and it might also notify you of the comment. If, when I came back to my draft comment, I wasn’t sure that I’d picked the right Mark Baker, I could simply mouse over the blue capsule to see your brief profile pop up. If, on the other hand, I hadn’t had the benefit of the UI, and had to enter something like <person>Mark Baker</person>, Google would have to figure which of all these people I meant: And if I’d accidentally typed “Laker” instead of “Baker”, there would be plenty of other wrong alternatives: In fact, to have any chance of getting a meaningful mention, the decision would still have to come back to me at some point.
While of course a single mention of a person’s name is not a typical example for structured content, the examples I gave in the post are fairly typical: still similar or identical names for different instances of the same type of thing. The Google+ mention UI shows that entering an unambiguous term can be easy, and the way it’s presented afterwards gives confidence that it will work. In my experience, this kind of confidence is incredibly important to writers. When confidence is shaken — for example when the system can’t disambiguate or can’t identify an entity (perhaps if a writer entered a variant term or misspelling that wasn’t yet recognized or stored by the system), it has an impact on writers’ flow. As well as worrying about writing well, they start to wonder whether their intentions or meanings will be correctly discerned by the system. That’s the main reason I’d prefer to get the unambiguous entity mention (in the form of an attribute) at source. There are a few other reasons:
It seems to place more power in authors’ hands. There’s less magic going on behind the scenes. We agree that for many structured content implementations, it doesn’t make much sense for writers to have control over how something’s presented. But to me, unambiguously expressing meaning seems to be something that more appropriately belongs to authors. If authors need to choose an element when entering a mention (as opposed to the system trying to automatically pick entities out of unmarked text — a big challenge), it’s hardly more effort to pick the correct instance of the item.Using RDFa in the source arguably makes the system simpler (no need for complex NLP at least; no need to re-match entities multiple times as a piece is revised; and if you want to output your semantics to the web, they may already be in a suitable format). Simpler means fewer bugs, and everyone’s life is easier, including authors.
This kind of solution does imply some serious tool customization (at least for now, before such solutions are common). That’s why I’m interested in experimental tools such as the one I linked to. It’s a bit rough around the edges, but at least has the two key features of being able to pick an entity from a context-generated list, and subsequently displaying the marked-up term in such a way that the existence of a linked entity and the specific details of the entity are clearly visible. There are definitely cases for which this would be overkill!

And Mark responded:

Hi Joe. I don’t disagree with you at all on this. Essentially this is still the resolution of ambiguity on the back end using NLP techniques, it’s just doing it in real time. The advantages of doing it in real time are that it allows the author to confirm the identification of the subject, and it allows the system to suggest subjects to the author, which can lead to more comprehensive semantic capture (unless it lets the author get lazy and not identify semantics the system does not catch). The downside is the cost of creating such interfaces, as you note, and that has to weigh into the total cost of semantics calculation. One in place, such a system will lower the cost of each piece of semantics captured. It is the fixed cost to establish it that is the issue.
But in general, I believe very strongly in the judicious combination of intelligent content and content intelligence techniques. Indeed, I would suggest as a general principle that the role of structured markup is to contextualize content sufficiently for content intelligence techniques to be used effectively. Don’t ask an author to disambiguate what a machine can effectively disambiguate for itself. All part and parcel of lowering the barriers to effective semantic capture.