Processing large XML files

In the rail industry we frequently have to download, parse and save huge amounts of data. These data feeds are in a variety of formats (XML, JSON, CSV and custom flat files) and from an array of different sources like FTP, queues, API calls, copying over SSH.

Most of the back-end code we have is written in PHP. Both the services we expose to the public and the scripts we use to manage background jobs like ticket fulfilment, reporting and updating data. Having a ubiquitous language has many benefits and there are a lot of reasons we like PHP. Unfortunately all languages have their drawbacks and in the case of PHP performance can be an issue when it comes to crunching large amounts of data. PHP doesn’t have brilliant support for multithreading or a good API for streaming file input , not to say it can’t be done, you just have to go to quite a lot of extra effort.


Recently our tech team started a good natured competition to speed up the processing of some of this data, so naturally I decided I would take part. The issue we have is parsing a 2GB XML file. It frequently blocks a server for a full hour, causing our system admin quite a few headaches. I wanted to use this competition as a chance to experiment with multithreaded applications (or rather a script in this case) and also to try a relatively new language, Kotlin. I have used Java and Kotlin before but it was typically for mobile applications where the CPU and memory performance considerations are very different.

Task details and goal

Long story short: Every evening some company uploads us a few zip files which include the XML data for RCS (Retail Control Service). RCS basically tells the industry who can retail what fares and by what fulfillment methods. The zip we are interested in is not too big but it contains ~2GB xml file with ~20 000 000 chunks of data we want to save. Each part of this data is spread on few descending elements and their attributes. Here you have the structure we need to read:

rcs_flow.xml

It is very simple and we don’t have to do any processing before saving it to the DB. Just some date formatting and parsing fm attribute to DB set. The primary goal of the challenge is to make the script as fast as possible. Besides that, as you can imagine, it would be good not to use all of the resources, we need to allow others to operate somehow.

I almost forgot, we would like to make it robust and reliable so that our administrator can sleep well.

Setup local environment

Before I could start with the task I had to install a few things on my local machine. I know that we have tools nowadays like vagrant, docker and so on, but I didn’t want to spend any time on that so I installed MySQL server locally and that was it basically, because I already had the Java development environment for Android development and my favourite IDE, IntelliJ IDEA.

Division the problem

Every computer scientist, every programmer and most of the people know that it is much simpler to finish a large job by dividing it into as many parts as possible. So I divided my task into parts like that:

  • load the XML content,
  • parse it into some dead simple data objects,
  • save each of that object into database,
  • do the final optimization,
  • parametrize the program.

Load the XML

I had a choice of manually unzipping and reading the whole 2GB file or using the Java zip stream from the 30MB file. I figured I would try to use the zip file and fall back to the unzipped file if it was faster.

Yes, yes I’m lazy. I put the zip file right into the application resources, but I didn’t want to play more with it as it worked just fine. Kotlin apply function helped here with the code readability a bit.

Parsing

Parsing the XML file wasn’t the simplest part of the task. Java has a couple of APIs for processing XML and I wanted to try them all. As you can see in the snippet above I ended up with the StAX library, but before digging into that I will write a few words about the first approach and I think most common: DOM.


DOM (Document Object Model) was my first approach. I knew it would blow my memory usage and freeze everything but I wanted to give it a try. As you can imagine my RAM usage was huge and I couldn’t stop the Java process.

Then I read some tutorials about the SAX (Simple API for XML) approach and wanted to try it together with the DOM approach. I thought I could load chunks of the data using the SAX and parse them using the DOM and then have nicely described data objects using Kotlin DSL capabilities. I found one implementation of that approach and also nice Kotlin library for XML.

I started with the SAX itself but it turns out that this is not as performant a way of doing this task as I would like it to be. Memory usage was much smaller than before because of streaming the content and processing it when it comes and lack of the internal representation of the document tree.

There is one more problem that I didn’t think of before.

SAX simply sends data to the application as it is read; your application can then do whatever it wants to do with the data it sees.
(source)

It sends data to the application. This this is the push model. It basically means that the app cannot control when it comes, cannot pause the reading process to do some other job and not fill all of the remaining memory with pending xml parsing results.


StAX to the rescue. Streaming API for XML is the last API I tried to do the job and I think it fits the most in here.

To summarize, StAX provides a standard, bidirectional pull parser interface for streaming XML processing, offering a simpler programming model than SAX and more efficient memory management than DOM.
(source)

If you are looking for a better comparison of all this methods please take a look here: Comparing StAX to Other JAXP APIs

As you can see, every RcsFlow data object is put into rcsFlows which is LinkedBlockingQueue with defined capacity. It forces the reader to wait until another part of the program dequeue some of the objects from that queue.

Save to the database

We are almost there. Only one important task left to do. This is saving the data to the database.

Fortunately, I already had a library to manage the DB connections somewhere in another project configured and tested so that I could copy it here. Of course my first thought was to use some shiny ORM. Don’t worry, I haven’t done that. This is only one entity type, so I have done this with plain Java API prepared statement, setting the parameters and executing it.

First approach: Save each row separately executing all of the small insert statements, each in a separate transaction.

Second approach: Save couple of rows as batch statement and then commit and close connection. It was the best solution I could think of. I’m sure there are hundreds of better ways or configuration options for the database connection, but I was pretty happy to stay with my solution.

After tweaking the configuration of the HikariCP connection pool and refactoring the code a bit I focused on the concurrency issues and way of parametrizing the script.

Multithreading

At the beginning there was only one thread reading and saving to database. It was before introducing the StAX API for reading the xml that allowed me to use already mentioned LinkedBlockinQueue. Thanks to that I could run the script in multiple threads. One that reads the xml and the second that saves elements from the queue to the database. To achieve that I used Java Executor implementation.

This approach though required me to create another work queue that held the same objects. I wasn’t happy with copying the objects again so I’ve done this using the ThreadPoolExecutor. It is the same what Executor is but programmer has more control over the creation of the thread pool.

I spent a lot of time tweaking the way the database insert executions are performed. How many threads to use, how many batch inserts to do, how the database connection should be configured. Also if that threads should wait for each other not to blow the database. It was quite a challenge. Finally, I settled on:

Thread 1: Add data object to the data queue while reading the xml. This queue has capacity that is equal to the amount of entries put into the batch insert (800 by default).

Thread 2: Drain the data queue when it is full to the job queue creating the job objects(runnables).

Thread 3 .. 5: This is the thread pool I’ve written about before that takes the job queue and runs each of the runnable that falls into the queue.

Thread 6: Killer thread. I couldn’t find better solution but to create another thread that is responsible to check from time to time if thread pool is still doing its job and shut it down when there is nothing to do.

Packaging

All that was left was to create jar gradle task to pack all of the things together and send to the world. I had to create fat jar because of the zip inside the resources folder. I also added dead simple parameter parser to be able to control the size of the batch insert and the amount of threads handling the database inserts.

Final run

You may have noticed that Kotlin wasn’t the most important in all of this but it certainly made it a more enjoyable program to write, and in my opinion, an easier program to read.

After all these tweaks the best time I achieved was 149 seconds. During the run the peak memory allocation was ~2GB and all cores were used at ~80%. The program was configured with batch insert containing 500 rows and concurrent threads saving to the database equal to 6. Tests were run on a laptop with quad-core processor with 6MB shared L3 cache, 16GB of DDR3L onboard memory and 256GB PCIe-based flash storage.

A big improvement on the original 1 hour of processing. Let’s see how everyone else does.

Author: Mateusz Angulski