Parsing XML using Elixir (Mostly)

For the last couple of weeks I’ve been diving into Elixir. So far it’s been a lot of fun learning. However, yesterday I got bogged down trying to understand an XML parsing example, which I’ve finally managed to figure out. Woo for me!

As an beginner to the Elixir ecosystem, it’s a little bit tricky because the language is relatively young and still evolving — meaning that coding examples that worked 12 months ago, may now be broken in one or more places — all likely to trip up the novice (yes, I mean me). Additionally, Elixir makes use of Erlang modules(where it makes sense), so as a learner you’ll inevitably end up having to learn how to integrate the two. It sounds a lot, I know, but it’s worth it.

Now enough chat! Let’s take a look at the example.

One Tiny Step for Mankind

To make it easy, as this is the very first step, we start off with some trivial XML markup embedded into the code.

@xml """
<title>XML Parsing</title>
<p>Some interesting text</p>

This defines a so-called module attribute, or a constant, that we can work with. What we’re aiming for is to pull out the contents of the title tag, ‘XML Parsing’

To do this, we’ll make use of an Erlang module xmerl to parse the XML.

{ doc, _ } = @xml |> :binary.bin_to_list |> :xmerl_scan.string

This line has a lot going on. The short explanation is that the XML is piped into bin_to_list, converting the string into a list, which is required by the Erlang function string in module xmerl_scan. The output, a 2-tuple, is pattern matched to retrieve the parsed doc. We ignore the rest.

What we have now is a parsed data structure that Erlang understands. Buried in there is the ‘XML Parsing’ text we want. Can you see it?

iex(3)> doc
{:xmlElement, :html, :html, [], {:xmlNamespace, [], []}, [], 1, [],
[{:xmlText, [html: 1], 1, [], ' ', :text},
{:xmlElement, :head, :head, [], {:xmlNamespace, [], []}, [html: 1], 2, [],
[{:xmlText, [head: 2, html: 1], 1, [], ' ', :text},
{:xmlElement, :title, :title, [], {:xmlNamespace, [], []},
[head: 2, html: 1], 2, [],
[{:xmlText, [title: 2, head: 2, html: 1], 1, [], 'XML Parsing', :text}],
[], 'e:/workspace/Elixir/xml_parsing', :undeclared},
{:xmlText, [head: 2, html: 1], 3, [],

The next small step is to use XPath to search this data structure to grab the title tag contents.

[ title_element ] = :xmerl_xpath.string('/html/head/title', doc)

The output of xmerl_xpath.string is a list with one element, which we name, title_element. Using Elixir’s interactive shell, iex, we can see its contents:

iex(5)> title_element
{:xmlElement, :title, :title, [], {:xmlNamespace, [], []}, [head: 2, html: 1],
2, [], [{:xmlText, [title: 2, head: 2, html: 1], 1, [], 'XML Parsing', :text}],
[], 'e:/workspace/Elixir/xml_parsing', :undeclared}

To pull this data back into an Elixir data structure, we need to set up some form of mapping. We’ll need a couple of record definitions for this purpose.

import Record, only: [defrecord: 2, extract: 2]
defrecord :xmlElement, extract(:xmlElement, from_lib: "xmerl/include/xmerl.hrl")
defrecord :xmlText, extract(:xmlText, from_lib: "xmerl/include/xmerl.hrl")

In short, the Erlang record structures :xmlElement and :xmlText are defined in library ‘xmerl/include/xmerl.hrl’, these are extracted and redefined as Elixir records, :xmlElement and :xmlText. The import statement is required to pull in the defrecord and extract functions, both having arity 2.

Note, if you experimenting in the interactive shell, records need to be declared inside a module; just wrap the lines above with:

iex(10)> defmodule XML do
<insert record definitions here>
...(10)> end

Armed with these records it’s now possible to convert the data retrieved from the Erlang modules into Elixir records. In the shell we can see that the parsed data, title_element, does contain a valid :xmlElement.

iex(8)> import Record
iex(9)> Record.is_record(title_element, :xmlElement)

We can now grab its contents with some pattern matching. Note here, :content is a key in the xmlElement record.

iex(12)> [content] = XML.xmlElement(title_element, :content)
[{:xmlText, [title: 2, head: 2, html: 1], 1, [], 'XML Parsing', :text}]

We almost there. All that is required is to grab the value from the :xmlText, record. But, first let’s confirm it’s valid.

iex(13)> Record.is_record(content, :xmlText)

and we’re done!

iex(14)> XML.xmlText(content, :value)
'XML Parsing'

Easy, eh?


We took a few small steps into XML parsing using Elixir with some help from an Erlang module, and explored how to prototype code using Elixir’s interactive shell. The gist can be found here.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.