From XML to Pandas dataframes

Roberto Preste
4 min readMay 29, 2019

--

XML is a markup language used to represent and distribute data structures which can be often difficult to create using more standard tabular formats.

Basically, the XML format is similar to HTML (which is another markup language, indeed), in that data are organised in elements, which define the type of information exposed, and each element contains the actual value in the form of content or attributes.

Photo by Florian Olivo on Unsplash

The XML page on Wikipedia offers an extensive overview of all the details and technicalities of this format, but the key concepts are simple. Each piece of information is delimited by a specific tag, like this:

<data>
<student name="John">
<email>john@mail.com</email>
<grade>A</grade>
<age>16</age>
</student>
<student name="Alice">
<email>alice@mail.com</email>
<grade>B</grade>
<age>17</age>
</student>
<student name="Bob">
<email>bob@mail.com</email>
<grade>C</grade>
<age>16</age>
</student>
<student name="Hannah">
<email>hannah@mail.com</email>
<grade>A</grade>
<age>17</age>
</student>
</data>

In this example, each student is represented by a <student> element, which has a name attribute containing the name of a specific student. Each of these elements has then sub-elements defined by the <email>, <grade> and <age> tags; between these tags the actual data content referring to the given student is present. Let’s say this data is saved in an XML file called “students.xml”.

We can think of this structure as a pandas DataFrame in which each student represents an observation, with its name attribute being the main identifier and the sub-elements being other features of the observation. A tabular representation of these data would be like this:

So we want to find a way to convert XML-structured data to a more functional table.

Given the structure of XML files, we can represent them as a tree, and this is the approach used by the xml.etree.ElementTree Python module. The parsing of our “students.xml” file starts at the root of the tree, namely the <data> element, which contains the entire data structure.

import xml.etree.ElementTree as et 

xtree = et.parse("students.xml")
xroot = xtree.getroot()

Now we can iterate through each node of the tree, which means we will get each student element and grab its name attribute and all of its sub-elements to build our dataframe.

for node in xroot: 
s_name = node.attrib.get("name")
s_mail = node.find("email").text
s_grade = node.find("grade").text
s_age = node.find("age").text

In order to get the name attribute, we use the attrib.get() function, while the text content of each element can be retrieved using the find() function of nodes.
Each iteration will return a set of data that can be thought as an observation in a pandas DataFrame; we can build this procedure as follows:

import pandas as pd 
import xml.etree.ElementTree as et

xtree = et.parse("students.xml")
xroot = xtree.getroot()

df_cols = ["name", "email", "grade", "age"]
rows = []

for node in xroot:
s_name = node.attrib.get("name")
s_mail = node.find("email").text if node is not None else None
s_grade = node.find("grade").text if node is not None else None
s_age = node.find("age").text if node is not None else None

rows.append({"name": s_name, "email": s_mail,
"grade": s_grade, "age": s_age})

out_df = pd.DataFrame(rows, columns = df_cols)

The downside to this approach is that you need to know the structure of the XML file in advance, and you have to hard-code column names accordingly.
We can try to convert this code to a more useful and versatile function, without having to hard-code any values:

import pandas as pd
import xml.etree.ElementTree as et

def parse_XML(xml_file, df_cols):
"""Parse the input XML file and store the result in a pandas
DataFrame with the given columns.

The first element of df_cols is supposed to be the identifier
variable, which is an attribute of each node element in the
XML data; other features will be parsed from the text content
of each sub-element.
"""

xtree = et.parse(xml_file)
xroot = xtree.getroot()
rows = []

for node in xroot:
res = []
res.append(node.attrib.get(df_cols[0]))
for el in df_cols[1:]:
if node is not None and node.find(el) is not None:
res.append(node.find(el).text)
else:
res.append(None)
rows.append({df_cols[i]: res[i]
for i, _ in enumerate(df_cols)})

out_df = pd.DataFrame(rows, columns=df_cols)

return out_df

An interesting piece of code is the dict comprehension in:

{df_cols[i]: res[i] for i, _ in enumerate(df_cols)}

where a dictionary in the form column name: node value is created and added to the list of rows. These will finally be used to create the complete dataframe.

If we apply our function to the “students.xml” file using parse_XML("students.xml", ["name", "email", "grade", "age"]), the result is precisely the table we saw above.

This is a more efficient implementation of the XML parsing function, although we still need to know the basic structure of the input XML document we want to parse.

--

--

Roberto Preste

I’m a Software Engineer, with a PhD in Bioinformatics, located in Cambridge (UK). I like writing about programming, data science and bioinformatics.