Parsing XML files in python

Prajwal Khairnar
Analytics Vidhya
Published in
4 min readOct 15, 2020

How to efficiently extract data from an XML file using simple python code in an easily manipulative form

XML (Extensible Markup Language) is a markup language which is very similar to HTML (Hypertext Markup Language). XML is used to structure data for transport and storage. It uses text and tags to describe the components in a file. XML files are a type of semi-structured data.

While working on a personal project in Python, I realized the need to extract the data from XML files into a suitable formats like CSV. It is easier to work with data present in such formats. Following is a simple way which I used to get the task done.

How to interpret the data in an XML file?

Above is a part of the XML file which we will be using for extracting data. You can download the original file from Kaggle.

This file has records for jobs in the UK, each with certain fields of information such as job title, job description, etc. A pre-requisite for this method is that we need to know the structure of the XML file. Here the <uniq_id> is used to mention a unique ID for the record. <url>, <job_title>, <category>, etc. tags contain the descriptive values for every job referred to by the current unique ID.

1. Importing necessary python libraries

import pandas as pd
import numpy as np
import xml.etree.cElementTree as et

we will be using the xml package from which we import etree.cElementTree as ‘et’ for working simplicity. ‘Element’ is a flexible container object which is used to store hierarchical data structures in the memory.

2. Parsing the XML file

tree=et.parse('/content/drive/My Drive/Colab Notebooks/Monster_UK_Jobs/monster_uk-job_sample.xml')root=tree.getroot()

Using et.parse() function we parse the XML file into object ‘tree’ (Replace the file path with the path to your XML file). The parsing of our file will start at the root of the tree. Using the .getroot() function, we get the root of the parsed tree. We can now use the root to iterate through our file.

3. Creating lists for record values

We can observe the structure of our XML file and determine the fields that are associated with each record. These fields are nothing but the values in tags like <url>, <job_title>, <category>, etc in our case. To store the extracted data, we create empty lists for each such tag value/field.

Timestamp = []
URL = []
Job_Title = []
Category = []
Company_Name = []
City = []
Post_Date = []
Job_Description = []
Company_Description = []
Job_Board = []
Geo = []
Location = []

4. Converting the data

Now using the root, we iterate through our file and store values in our lists that we created in the previous step.

for time in root.iter('crawl_timestamp'):
print(time.text)
print("step1")
Timestamp.append(time.text)
for url in root.iter('url'):
print(url.text)
print("step2")
URL.append(url.text)
for title in root.iter('job_title'):
print("step3")
Job_Title.append(title.text)
for cat in root.iter('category'):
print("step4")
Category.append(cat.text)
for cname in root.iter('company_name'):
print("step5")
Company_Name.append(cname.text)
for ct in root.iter('city'):
print("step6")
City.append(ct.text)
for pdate in root.iter('post_date'):
print("step7")
Post_Date.append(pdate.text)
for jd in root.iter('job_description'):
print("step8")
Job_Description.append(jd.text)
for cd in root.iter('company_description'):
print("step9")
Company_Description.append(cd.text)
for job_bd in root.iter('job_board'):
print("step10")
Job_Board.append(job_bd.text)
for go in root.iter('geo'):
print("step11")
Geo.append(go.text)
for loc in root.iter('location'):
print("LAST STEP")
Location.append(loc.text)

Using root.iter(‘__’) function by filling in value for each respective field/tag, we can write a separate for loop for each field. I have added a print statement in each for loop to understand the process flow during execution.

We have now parsed our data and stored the values in separate lists. The next step is to convert it into a pandas dataframe to make the data easily usable.

5. Creating a pandas dataframe

Jobs_df = pd.DataFrame(
list(zip(Job_Title, Category, Company_Name, Job_Description, Company_Description, Post_Date, Job_Board, City, Geo, Location, Timestamp, URL)),
columns=['Job_Title', 'Category', 'Company_Name', 'Job_Description', 'Company_Description', 'Post_Date', 'Job_Board', 'City', 'Geo', 'Location', 'Timestamp', 'URL'])

Selecting all the lists in desired order and adding column names for the respective columns, we used pd.DataFrame() function to create a dataframe.

After executing the above code, you can view the created dataframe, it will look as follows,

Dataframe Preview

6. Saving the extracted data (CSV Format)

We can now save our dataframe as a csv file for easy storage and usage later.

Jobs_df.to_csv("monster_uk-job_sample.csv")

We have successfully parsed the XML file and extracted data for use!

For complete code and data, you can also visit the following Github repository.

--

--

Prajwal Khairnar
Analytics Vidhya

Data Scientist | IT Engineer | Research interests include Statistics | NLP | Machine Learning, Data Science and Analytics, Clinical Trials