A simple introduction to XML and JSON

Two popular formats to store structured data, that are also commonly used in bioinformatics analyses.

Andrea Telatin
#!/ngs/sh
Published in
5 min readNov 27, 2018

--

Bioinformatics and text files

Most bioinformatics file formats are simple text files, a famous example being the FASTA format to store sequences. Historically, most file formats were proposed to ad hoc address a specific need, resulting in a fragmented universe of formats.

Examples of famous bioinformatics formats are the FASTA and FASTQ for sequences, the SAM format to store details of sequence mappings, the VCF format to describe the variants of an individual compared to a reference genome, the GFF and BED formats to describe features in a genome (e. g. genes, enhancers, binding sites…).

XML and JSON

Among the “general purpose” formats commonly used in computer science, two are XML (for eXtensible Markup Language) and JSON (JavaScript Object Notation). The former has been very popular at the beginning of the new century, while the latter gained popularity later in this decade. They are both meant to encode structured information, and possibly to be able to describe any form of document needed (not necessarily in an ideal way). XML is more formal and enables a strict adherence to a defined structure, while JSON is a simpler data container (but this simplicity resulted in a good popularity in later times, the BIOM 1.0 format is an example of widely adopted JSON format).

This short note is meant to gently introduce the two formats, and give at least a “visual” idea of the two.

JSON format

A JSON document is composed by a list of items stored as key and value pairs. Values can be single (strings, integers, floating point…) or list of values (also referred to as arrays).

Suppose that each of us introduces himself telling his name, surname and some hobbies. The latter information is definitely a list, that can be empty or containing a single or multiple values.

Here the JSON notation I could use to introduce myself to a computer:

Generally speaking, a JSON object is enclosed in curly brackets, each item is a “key: value” pair separated by commas. Arrays (or lists) are enclosed in squared brackets (again comma separated). It should be noted that I added spaces (indentation) and new lines to make the object nicer to read, but for a computer also this notation is valid and totally equivalent:

The format is hierarchical, meaning that lists can be nested inside other lists. It should be noted that the order of the items is mostly irrelevant (surname is written after the name, but it at the same level of hierarchy).

XML format

Let’s start encoding the former dataset in one of the possible XML representations:

The typical piece of data in XML is represented with tags, like <id>192</id>, the value of a tag is enclosed by a tag opening (<id>) and closing form (</id>). There is no single way to encode a list, but it’s simply possible to repeat the<hobbies> tag as many times is needed. Or a more formal structure could use a nested tag, with as many children items <hobby> as needed:

XML, like JSON, can be encoded removing newlines and indentations that are usually added to make it clearer to decode when read by a human. There are online tools to “pretty print” both XML and JSON. Each tag can carry a set of properties.

For example, if we want to keep track of the order of the hobbies:

The ability to nest tags make particularly evident that XML documents, like JSON objects, are a representation of data in a tree format, with a root with as many branches are the upper-level tags are, and so forth.

The recent Microsoft Office format all end by x, like “.docx”: they are all XML documents compressed with extra files (like embedded images)!

JSON and XML in bioinformatics

Both JSON and XML are not very common to store NGS-related bioinformatics files. A major criticism against XML, for example, is that repeating thousand times the same tags result in a massive waste of space (even if workarounds have been proposed). The most common alternative to “structured” files are tabular files (CSV, TSV), that require much less “extra” characters but are more prone to parsing problems and ambiguities.

The advantage of using hierarchical and/or structured languages becomes more clear when we need to transfer self-describing data, like metadata.

I could make a tabular file to keep track of our hobbies with lines like:

In tabular files the sole description of each “field”, or column, is in the header (if present), and in this case we have to know in advance that the third column is a comma-separated list of hobbies, otherwise tabular files are better to store single values, and not arrays, not to mention more complex (nested) structures, virtually impossible to be stored in a tabular file.

XML and JSON excel when it comes to metadata transfer and are a very popular “server response”. It is thus important to be able to retrieve specific fields from XML/JSON dataset especially to automate multiple queries against web servers.

Example: XML record from PubMed

We can display a PubMed record, knowing the PubMed ID using the usual URI:

and simply adding a string at the end, we can fetch the XML format of the same record (try it):

As you can see it’s easy to programmatically retrieve such a record, simply changing the PubMed ID from the above URI.

Here some lines from the record above:

An extract from the XML object describing a journal article

Do it yourself: to try “navigating” the complex tree of the record above, first copy all the XML code from the PubMed record URI, then paste it into the https://codebeautify.org/xmlviewer tool, and click “Tree View”. You’ll be able to collapse or expand the tree nodes of the document. It will be easy to describe the path to needed information. For example, if you want to extract the journal name from such a record the path is:

XML → PubMedArticle → MedLineCitation → Article → Journal →Title

XML data from repositories

Both the NCBI SRA and EBI Metagenomics (and ENA) archives are accessible programmatically to mine data about experiments and sequencing runs. We’ll cover these aspects in separate notes.

--

--