Erm, well, a potato

You say potato, I say 706f7461746f

…coping with the complexities of data exchange

Rob Phippen
SOA and Integration
7 min readApr 28, 2013

--

Like humans, computers exchange information in a bewildering variety of ways, making direct communication tough. DFDL holds the promise of solving this problem once and for all.

The difference between information and data on a wire

Say I have a have a burning desire to tell you about a vegetable that happens to be a potato. Let’s further assume that I want to convey that concept as a message between two computers. I have to find a way of formatting and encoding that message into a specific representation that can be sent on a wire. Computers being computers, that’s going to eventually reduce to a bunch of numbers.

Of course the idea of a vegetable can exist entirely independent from any encoding! We can talk about an information element called vegetable, with the value of potato, and then discuss all kinds of different encodings that can be used to send them on a wire. You’ll see a fair number of those in this article…

The message format Tower of Babel

I’m not certain what was the very first time information was sent along a wire from one computer to another (I'll assume that it happened somewhere in or between the top-secret huts at Bletchley Park, in the 1940s), but I rather doubt it was about potatoes. One thing I do know for sure is that, however that information was encoded, it isn't the only encoding in use these days. Not by a very long way. The human race has proven remarkably inventive in taking a piece of information, and then devising totally new ways of sending that information along a wire. Depending on how you count it, there are at least three fundamentally different ways of ‘encoding’ information, and a branching myriad of detailed ways in which that information actually gets transmitted. All of which means that, picking any pair of computer systems at random, it’s not at all unlikely that they will disagree about how information is sent and received.

The ‘diplomatic corps’ message formats

As with human ‘diplomatic’ languages - like French, Spanish, English and Mandarin Chinese - it’s not too surprising to find that some formats have become contenders for a common format of exchange. Established as a formal standard several internet aeons ago in 1998, XML is undoubtedly the big kid on the block, with the most astonishing hierarchy of communication standards built on its base, including the entire Web Services standards menagerie.

<vegetable>potato</vegetable>

Example XML rendering of “potato”: in this example - XML uses <vegetable> as the ‘start token’ and </vegetable> as the ‘end token’ for an element called vegetable. Putting potato between the start end end token sets the value of the element.

Other honorable contenders include the somewhat newer JSON, the JavaScript Object Notation.

{vegetable: “potato”}

Example JSON Rendering of “potato”

“Here’s what I’m about to tell you” - a sidebar on XML Schema

A vast amount of human ingenuity has been lavished on standards around XML. One of them is XML schema, which is rather important to the topic of this article, as you’ll see later. It is not ‘yet another format’. Instead, in the world of XML, it provides an exact description of what you should expect to see in your XML message. So, for my little XML example above, the XML Schema that says ‘expect to see some XML text that represents an element called vegetable’ is…

<xsd:element name="vegetable" type="xsd:string" />

XML Schema snippet: this says ‘expect to see an element called vegetable, whose type is a simple string of text’

Notice that the XML Schema does not say anything at all about potatoes! The same XML schema can be used for all of the following, and indeed for any XML that sets the value of vegetable to some string of text…

<vegetable>carrot</vegetable>
<vegetable>parsnip</vegetable>
<vegetable>aubergine</vegetable>

… and XML Schema is quite capable of describing complex nested structures.

The continued popularity of ‘local dialects’

Despite the raging success of XML and JSON, other formats simply refuse to die out. So much so that, if this article consisted of nothing but a list of the different possible (and regularly used) ways of encoding potato, it would still be pretty long…

DFDL - a universal translator for message formats?

Universal translators are popular in science fiction movies - nobody wants to watch a movie in which nobody can understand anything. In Star Wars, the computer did it; in Doctor Who, it’s a universal translation field emanated by the Tardis, and the Babel Fish, from the Hitch Hiker’s Guide to the Galaxy, has the misfortune of having the property that it can translate from any language to any other. I say misfortune, because the fish has to be ‘stuck in your ear’ to work.

DFDL is the best approach I know of to the creation of a universal translator for coping with the proliferation of message formats. You don’t even have to stick it in your ear. So - what is DFDL?

DFDL - the Data Format Description Language

DFDL is a way of exactly describing a message format - in principle any format. This is profound, because that exact description can be exploited to instruct a computer on exactly how to turn the message into information.

DFDL is sneaky - in a good way - because it exploits the pre-existence of XML Schema as a base.

XML Schema revisited - as a way of describing pure structure

The team that created DFDL noticed that, although XML Schema originated as a way of describing the format of an XML-encoded message, it is in fact a rather decent way of describing information structures - independent of their format. In a sense this view of invites us to look at this piece of XML Schema again…

<xsd:element name="vegetable" type="xsd:string" />

… understand that it says ‘here comes a message containing an information element called vegetable’ and simply ignore that it’s meant to be sent along a wire in XML format. For example: the specifics of how XML does things are what leads to <vegetable> as a start token, and </vegetable> as an end token. Other formats are under no obligation whatsoever to use that convention.

DFDL fills in the gaps…

What XML Schema is missing is a way of describing information encodings other than XML (not too surprisingly: the team that devised XML Schema were able to simply assume that XML was being used). This encoding description is exactly the gap that DFDL closes.

Continuing with my footling example. Let’s say I have decided that I’m going to use DFDL to flag up the fact that, instead of using XML to send a message, I am have devised my very own format. Instead of using all these angle brackets that XML is so fond of (like <vegetable>), I'm simply going to prefix each vegetable with a special string…

vegetable:

…and terminate it with a semicolon. So my message about a potato is going to read

vegetable:potato;

…and my other examples will look like this;

vegetable:carrot;
vegetable:parsnip;
vegetable:aubergine;

The relevant DFDL snippet that can be used with all of these is rather simple (though, inevitably, not as simple as the original XML schema);

<xsd:element 
dfdl:lengthkind=”delimited”
dfdl:initiator=”vegetable:” dfdl:terminator=”;”
dfdl:representation=”text”
dfdl:encoding=”ASCII”
name="vegetable"
type="xsd:string"/>

Take a moment to compare this with the ‘pure’ XML Schema snippet: the tags in bold tell the whole story. The extra DFDL tags say:

  • dfdl:lenghtkind=”delimited” - don’t expect XML - instead, expect to see something where the element is ‘bracketed’ with some custom strings
  • dfdl:initiator=”vegetable:” - this the string of text that denotes the start of the element
  • dfdl:terminator=”;” - this is the string of text that denotes the end of the element
  • dfdl:representation=”text” the message will be in ‘plain text’ (not e.g. binary)
  • dfdl:encoding=’ASCII’ - in the end, all information sent between computers boils down to binary numbers - so - for each letter, an encoding says which number to turn it into. ‘ASCII’ is the name of one particular encoding.

This is the tiniest example I can think of that begins to show how DFDL takes XML schema - then exploits it, partially ignores it (by ignoring the implied serialization to XML), and extends it (by adding tags that have a special meaning).

The title revealed

Talking about encodings, the title of this piece the ASCII encoding of the word potato, using hexadecimal numbers

p   o   t   a   t   o
70 6f 74 61 74 6f

ASCII encoding of potato using hexadecimal numbers

A confession

DFDL and XML Schema experts will have noticed that I have left out some important aspects of both XML Schema and DFDL in my snippet examples, but I hope this begins to convey the principle. There is a nice - and complete - example here.

A description is not enough

So: now I have my DFDL description. In order for it to be of any use, I need a piece of software that will actually do something useful with it. In the world of integration, that means two things

  • Parsing: on receipt of a message, processing the bytes from the message and turning them into a representation that a program can understand, and…
  • Serialization: when sending a message: taking the program’s representation of a piece of information and turning it into the bytes that need to go into the message.

Once I have a parser/serializer (often contracted to ‘parser’) that can understand the format description provided by DFDL, and use it to parse and serialize a message, then the Message Format tower of Babel largely collapses - and the myriad of message formats becomes completely open to me.

Who cares?

The message format tower of Babel really exists as an everyday concern in making software communicate. When reading a file, when trying to connect two (or more!) pieces of software - each of them can be effectively shouting in or listening for a different message format ‘language’ - so DFDL combined with a parser is an extremely important tool to help them communicate.

Where to look for more

DFDL is the subject of a standards initiative at the Open Grid Forum

An Open Source DFDL processor known as Daffodil is also under active development with initial release in spring 2013.

DFDL is implemented today in the IBM WebSphere Message Broker and is thoroughly integrated into its wider integration capabilities.

--

--

Rob Phippen
SOA and Integration

Baldy, geek or possibly boffin; coffee addict, cycling fanatic, terrible but hopefully improving at drawing and painting, tin whistle player