The Hidden XML Simplifier

Brujo Benavides
Erlang Battleground
6 min readNov 15, 2016

--

Have you ever had to parse XML with Erlang? Have you used xmerl for that? Have you seen all the records returned by xmerl functions and though “Woah, this is even more verbose than XML itself!”? Have you wished there was a better, nicer, simpler way to represent an XML document in Erlang terms?

No? Great! I’m so happy for you! You can safely skip this article altogether.

Now, if you have fought your battles with #xmlElement, #xmlNamespace and its friends and wished there was a better way to represent simple XML documents in Erlang terms… well… there is one! And it doesn’t involve any third party library, although there are a couple of great ones around.

A simple XML document (by John Kerouac) — Original Here

Introduction

So, as an example, let’s say you have to parse the following XML file…

family.xml

Unlike with JSON, there is an XML parser that comes bundled with OTP: xmerl. So, since you already have it, you decide to use it to parse the above file. After a little bit of googling, you determine that all you need is this function:

1> xmerl_scan:file("family.xml").
{{xmlElement,family,family,[],
{xmlNamespace,'http://world.com/family',[]},
[],1,
[{xmlAttribute,xmlns,[],[],[],
[{family,1}],
1,[],"http://world.com/family",false}],
[{xmlText,[{family,1}],1,[],"\n ",text},
{xmlElement,parents,parents,[],
{xmlNamespace,'http://world.com/family',[]},
[{family,1}],
2,[],
[{xmlText,
[{parents,2},{family,1}],
1,[],"\n ",text},
{xmlElement,person,person,[],{xmlNamespace,...},[...],...},
{xmlText,[{parents,2},{family,...}],3,[],[...],...},
{xmlElement,person,person,[],...},
{xmlText,[{...}|...],5,...}],
[],".",undeclared},
{xmlText,[{family,1}],3,[],"\n ",text},
{xmlElement,children,children,[],
{xmlNamespace,'http://world.com/family',[]},
[{family,1}],
4,[],
[{xmlText,[{children,4},{family,...}],1,[],[...],...},
{xmlElement,person,person,[],...},
{xmlText,[{...}|...],3,...},
{xmlElement,person,...},
{xmlText,...}],
[],".",undeclared},
{xmlText,[{family,1}],5,[],"\n",text}],
[],".",undeclared},
[]}

Oh, well… that was verbose! First of all, the docs state clearly that the result of xmerl_scan:file/1 is a tuple {xmlElement(), Rest} and xmlElement() = #xmlElement{}. So, let’s try adding the record definitions to our shell…

2> rr(code:lib_dir(xmerl) ++ "/include/xmerl.hrl").
[xmerl_event,xmerl_fun_states,xmerl_scanner,xmlAttribute,
xmlComment,xmlContext,xmlDecl,xmlDocument,xmlElement,
xmlNamespace,xmlNode,xmlNsNode,xmlObj,xmlPI,xmlText]
3> {Element, _} = xmerl_scan:file("family.xml"), Element.
#xmlElement{
name = family,expanded_name = family,nsinfo = [],
namespace =
#xmlNamespace{
default = 'http://world.com/family',nodes = []},
parents = [],pos = 1,
attributes =
[#xmlAttribute{
name = xmlns,expanded_name = [],nsinfo = [],namespace = [],
parents = [{family,1}],
pos = 1,language = [],value = "http://world.com/family",
normalized = false}],
content =
[#xmlText{
parents = [{family,1}],
pos = 1,language = [],value = "\n ",type = text},
#xmlElement{
name = parents,expanded_name = parents,nsinfo = [],
namespace =
#xmlNamespace{
default = 'http://world.com/family',nodes = []},
parents = [{family,1}],
pos = 2,attributes = [],
content =
[#xmlText{
parents = [{parents,2},{family,1}],
pos = 1,language = [],value = "\n ",type = text},
#xmlElement{
name = person,expanded_name = person,nsinfo = [],
namespace =
#xmlNamespace{default = 'http://world.com/family',...},
parents = [{...}|...],
pos = 2,...},
#xmlText{
parents = [{parents,2},{family,1}],
pos = 3,language = [],value = "\n ",type = text},
#xmlElement{
name = person,expanded_name = person,nsinfo = [],
namespace = {...},...},
#xmlText{
parents = [{parents,...},{...}],
pos = 5,language = [],...}],
language = [],xmlbase = ".",elementdef = undeclared},
#xmlText{
parents = [{family,1}],
pos = 3,language = [],value = "\n ",type = text},
#xmlElement{
name = children,expanded_name = children,nsinfo = [],
namespace =
#xmlNamespace{
default = 'http://world.com/family',nodes = []},
parents = [{family,1}],
pos = 4,attributes = [],
content =
[#xmlText{
parents = [{children,4},{family,1}],
pos = 1,language = [],value = "\n ",type = text},
#xmlElement{
name = person,expanded_name = person,nsinfo = [],
namespace = {...},...},
#xmlText{
parents = [{children,...},{...}],
pos = 3,language = [],...},
#xmlElement{name = person,expanded_name = person,...},
#xmlText{parents = [...],...}],
language = [],xmlbase = ".",elementdef = undeclared},
#xmlText{
parents = [{family,1}],
pos = 5,language = [],value = "\n",type = text}],
language = [],xmlbase = ".",elementdef = undeclared}

Well… That’s clearer but it’s still a lot of information. And I hear you, Erlang masters: We’re not supposed to inspect those records visually. We should use the functions provided in the xmerl modules to walk through them. The extra info provided by the #xml… records is there precisely for that.

Nevertheless, more often than not, I find myself wanting a simpler representation of the XML, with just the basic data that I need, if possible in tuple format.

Enter xmerl_lib

Luckily for me, there is one! It’s just not documented. I don’t remember when or how I found it the first time, but if you google xmerl_lib, you will find its code in the OTP repository and some sort of documentation dating back to Erlang/OTP 17.4.1. If you try to find the same module in OTP 19.x docs, it will not be there anymore.

In the Erlang/OTP 17.4.1 docs you can find, along with a couple of other really cool-looking functions like foldxml/3 and mapxml/2, a nice function called simplify_element/1. Let’s try to use it, shall we?

4> {Element, _} = xmerl_scan:file("family.xml"),
4> xmerl_lib:simplify_element(Element).
{family,[{xmlns,"http://world.com/family"}],
["\n ",
{parents,[],
["\n ",
{person,[{gender,"male"},{origin,"Argentina"}],
["Javier Lopez"]},
"\n ",
{person,[{gender,"female"},{origin,"Ecuador"}],
["Elvira Perez"]},
"\n "]},
"\n ",
{children,[],
["\n ",
{person,[{gender,"male"},{origin,"Argentina"}],
["Armando Lopez"]},
"\n ",
{person,[{gender,"male"},{origin,"Guatemala"}],
["Luciano Lopez"]},
"\n "]},
"\n"]}

Hey!! Now we’re talking! It’s much closer to what we were actually looking for, right? We just need to get rid of those nasty whitespaces. And we can do that with a cool trick that took me a couple of attempts and some actual code checking to discover…

5> f(Element),
5> {Element, _} =
5> xmerl_scan:file("family.xml", [{space, normalize}]),
5> [Clean] = xmerl_lib:remove_whitespace([Element]),
5> xmerl_lib:simplify_element(Clean).
{family,[{xmlns,"http://world.com/family"}],
[{parents,[],
[{person,[{gender,"male"},{origin,"Argentina"}],
["Javier Lopez"]},
{person,[{gender,"female"},{origin,"Ecuador"}],
["Elvira Perez"]}]},
{children,[],
[{person,[{gender,"male"},{origin,"Argentina"}],
["Armando Lopez"]},
{person,[{gender,"male"},{origin,"Guatemala"}],
["Luciano Lopez"]}]}]}

That’s exactly what I was looking for. Each XML element is now represented by a tuple with 3 elements:

  • the tag name (e.g. person)
  • the attributes (e.g. [{gender, …}, …])
  • the content, which is a list of children elements and/or strings

What’s going on here?

The only thing that I left unexplained today is the whole whitespace removal trick. I’ve got to say that with so many undocumented functions in xmerl_lib, there might be a better way to do it. In any case, I’ll explain the one that I found:

This is the definition of remove_whitespace/1 in OTP 19:

remove_whitespace([#xmlText{value = " "} | Data]) ->
remove_whitespace(Data);
remove_whitespace([E = #xmlElement{content = Content} | Data]) ->
[ E#xmlElement{content = remove_whitespace(Content)}
| remove_whitespace(Data)
];
remove_whitespace([Other | Data]) ->
[Other | remove_whitespace(Data)];
remove_whitespace([]) ->
[].

There is no spec, but from the code we can see that it’s a recursive function that works on lists of #xml… records by basically removing all instances of #xmlText{value = “ ”}.

I didn’t have a list of #xml… records, I had just one #xmlElement. That’s why I put it on a list and, knowing that since it was not an #xmlText remove_whitespace/1 will just return a list with that element cleaned up, I could just pattern-match it out of the result and call it Clean.

But there was still one more catch to it. The whitespaces I wanted to remove were not just “ ” (single space character), they had carriage-returns, tabs, consecutive spaces, etc. I had to find a way to normalize those. That thing is actually documented within the option_list() type (which is used by xmerl_scan:file/2):

{space, Flag}'preserve' (default) to preserve spaces, 'normalize' to accumulate consecutive whitespace and replace it with one space.

With that change, I could put all the pieces together and get the result that I needed in the first place.

Final Notes

If I manage to find time, I’ll try to send a PR with this to Federico Carrone’s erlang-katana. I wouldn’t mind if one of my readers beat me to it ;)

Thanks Harenson for pushing me to write this post after countless times googling and asking around for that xmerl function that returns tuples instead of records.

I created a playground in tryerl with what I showed in this post. There, if not in your own computer, you can play around with all the functions in xmerl_lib. If you find something interesting, please share it in the comments below :)

--

--