XML Parsing In Ruby

Arun Mathew Kurian
Thoughts on Tech
Published in
7 min readOct 16, 2017

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. It is defined by the W3C’s XML 1.0 Specification and by several other related specifications, all of which are free open standards.

The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, it is widely used for the representation of arbitrary data structures such as those used in web services.

Several schema systems exist to aid in the definition of XML-based languages, while many application programming interfaces (APIs) have been developed to aid the processing of XML data.

There are two ways in which an XML document can be parsed. They are:

  • SAX Parsing
  • DOM Parsing

SAX Parser

SAX (the Simple API for XML) is an event-based parser for XML documents. Unlike a DOM parser, a SAX parser creates no parse tree. SAX is a streaming interface for XML, which means that applications using SAX receive event notifications about the XML document being processed, an element, and attribute at a time in sequential order starting at the top of the document and ending with the closing of the ROOT element.

  • Reads an XML document from top to bottom, recognizing the tokens that make up a well-formed XML document
  • Tokens are processed in the same order that they appear in the document
  • Reports the application program the nature of tokens that the parser has encountered as they occur
  • The application program provides an “event” handler that must be registered with the parser
  • As the tokens are identified, callback methods in the handler are invoked with the relevant information

SAX parsing is done on large documents. It is done if there are memory limitations associated with parsing. SAX parsing is done by writing callbacks for events of interests and let the parser proceed through the document.

DOM Parser

DOM parsing is done by reading the entire file into memory and stored in a hierarchical (tree-based) form to represent all the features of an XML document. The DOM is a common interface for manipulating document structures. One of its design goals is that Java code written for one DOM-compliant parser should run on any other DOM-compliant parser without changes

The DOM defines several Java interfaces. Here are the most common interfaces:

  • Node — The base datatype of the DOM.
  • Element — The vast majority of the objects you’ll deal with are Elements.
  • Attr — Represents an attribute of an element.
  • Text — The actual content of an Element or Attr.
  • Document — Represents the entire XML document. A Document object is often referred to as a DOM tree.

SAX obviously can’t process information as fast as DOM can, when working with large files. On the other hand, using DOM exclusively can really kill your resources, especially if used on a lot of small files.

SAX is read-only, while DOM allows changes to the XML file. Since these two different APIs literally complement each other there is no reason why you can’t use them both for large projects

Methods of XML Parsing in Ruby

There are many ways in which XML documents can be parsed in Ruby.

REXML

REXML is a pure-Ruby XML processor conforming to the XML 1.0 standard. It is a nonvalidating processor, passing all of the OASIS nonvalidating conformance tests. REXML was inspired by the Electric XML library for Java, which features an easy-to-use API, small size, and speed. Hopefully, REXML, designed with the same philosophy, has these same features. I’ve tried to keep the API as intuitive as possible, and have followed the Ruby methodology for method naming and code flow, rather than mirroring the Java API.

REXML parser has the following advantages over other available parsers:

  • It is written 100 percent in Ruby.
  • It can be used for both SAX and DOM parsing.
  • It is lightweight. less than 2000 lines of code.
  • Methods and classes are really easy-to-understand.
  • SAX2-based API and Full XPath support.
    Shipped with Ruby installation and no separate installation is required.

Consider an XML document

<Movies genre = “Science Fiction”>
<Movie name = “Interstellar”>
<Year>2014</Year>
<Actor character = “Cooper”>Mathew McConaughey</Actor>
<Actor character = “Dr Brand”>Michael Caine</Actor>
<Director>Christopher Nolan</Director>
</Movie>
<Movie name = “Mad Max Fury Road”>
<Year>2015</Year>
<Actor character = “Max”>Tom Hardy</Actor>
<Actor character = “Furiosa”>Charlize Theron</Actor>
<Director>George .S .Miller</Director>
</Movie>
</Movies>

Now consider this ruby script

#!/usr/bin/ruby -wrequire ‘rexml/document’include REXMLxmlfile = File.new(“movies.xml”)xmldoc = Document.new(xmlfile)# Now get the root elementroot = xmldoc.rootputs “Root element : “ + root.attributes[“genre”]# This will output all the movie titles.xmldoc.elements.each(“Movies/Movie”) do |e| puts “Movie Title : “ + e.attributes[“name”]end# This will output all the movie actors.xmldoc.elements.each(“Movies/Movie/Actor”) do |e| puts “Movie Actor : “ + e.text + “ as “ + e.attributes[“character”]end# This will output all the movie actors.xmldoc.elements.each(“Movies/Movie/Director”) do |e|  puts “Movie Director : “ + e.textend

This will produce the following result:

Root element : Science Fiction
Movie Title : Interstellar
Movie Title : Mad Max Fury Road
Movie Year : 2015
Movie Year : 2014
Movie Actor : Mathew McConaughey as Cooper
Movie Actor : Michael Caine as Dr Brand
Movie Actor : Tom Hardy as Max
Movie Actor : Charlize Theron as Furiosa
Movie Director : Christopher Nolan
Movie Director: George.S.Miller

SAX-like Parsing:

To process the same data, movies.xml, file in a stream-oriented way we will define a listener class whose methods will be the target of callbacks from the parser.

#!/usr/bin/ruby -wrequire ‘rexml/document’require ‘rexml/streamlistener’include REXMLclass MyListener  include REXML::StreamListener  def tag_start(*args)    puts “tag_start: #{args.map {|x| x.inspect}.join(‘, ‘)}”  endendlist = MyListener.newxmlfile = File.new(“movies.xml”)Document.parse_stream(xmlfile, list)

XPath and Ruby:

An alternative way to view XML is XPath. This is a kind of pseudo-language that describes how to locate specific elements and attributes in an XML document, treating that document as a logical ordered tree.

REXML has XPath support via the XPath class.

#!/usr/bin/ruby -wrequire ‘rexml/document’include REXMLxmlfile = File.new(“movies.xml”)xmldoc = Document.new(xmlfile)# Info for the first movie foundmovie = XPath.first(xmldoc, “//Movie”)p movie# Print out all the movie actorsXPath.each(xmldoc, “//Actor”) { |e| puts e.text }# Get an array of all of the movie directors.names = XPath.match(xmldoc, “//Director”).map {|x| x.text }p names

<movie title=’Interstellar’> … </>
Mathew McConaughey
Michael Caine
Tom Hardy
Charlize Theron
Christopher Nolan
George.S.Miller

XSLT and Ruby:

XSL stands for EXtensible Stylesheet Language and is a style sheet language for XML documents.XSLT stands for XSL Transformations. There are two XSLT parsers available that Ruby can use. A brief description of each is given here:

Ruby-Sablotron:

This parser is written and maintained by Masayoshi Takahashi. This is written primarily for Linux OS and requires the following libraries:

Sablot
Iconv
Expat

You can find this module at Ruby-Sablotron.

XSLT4R

XSLT4R is written by Michael Neumann and can be found at the RAA in the Library section under XML. XSLT4R uses a simple commandline interface, though it can alternatively be used within a third-party application to transform an XML document.

XSLT4R needs XMLScan to operate, which is included within the XSLT4R archive and which is also a 100 percent Ruby module. These modules can be installed using standard Ruby installation method (i.e., ruby install.rb).

XSLT4R has the following syntax:

ruby xslt.rb stylesheet.xsl document.xml [arguments]

If you want to use XSLT4R from within an application, you can include XSLT and input the parameters you need. Here is the example:

require “xslt”stylesheet = File.readlines(“stylesheet.xsl”).to_sxml_doc = File.readlines(“document.xml”).to_sarguments = { ‘image_dir’ =&amp;amp;amp;amp;gt; ‘/….’ }sheet = XSLT::Stylesheet.new( stylesheet, arguments )# output to StdOutsheet.apply( xml_doc )# output to ‘str’str = “”sheet.output = [ str ]sheet.apply( xml_doc )

Nokogiri

XML parsing can be done in ruby with the help of a gem called Nokogiri. Nokogiri is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.

Consider the file

Colletion.xml

<Collection version=”2.0" id=”100">
<Name>Movies</Name>
<Movie name = “Interstellar”>
<Year>2014</Year>
<Actor character = “Cooper”>Mathew McConaughey</Actor>
<Actor character = “Dr Brand”>Michael Caine</Actor>
<Director>Christopher Nolan</Director>
<Keywords>
<Keyword>Nolan</Keyword>
<Keyword>Black Hole</Keyword>
<Keyword>Cooper</Keyword>
</Keywords>
</Movie>
<Movie name = “Mad Max Fury Road”>
<Year>2015</Year>
<Actor character = “Max”>Tom Hardy</Actor>
<Actor character = “Furiosa”>Charlize Theron</Actor>
<Keywords>
<Keyword>Miller</Keyword>
<Keyword>Furiosa</Keyword>
<Keyword>Max</Keyword>
</Keywords>
<Director>George .S .Miller</Director>
</Movie>
</Collection>
f = File.open(“/path/to/the/collection.xml”)
doc = Nokogiri::XML(f)

The first thing we’d like to do is select the id attribute from the root. There are two ways you can do this

doc.at_xpath(“/*/@id”)

which returns the value

#<Nokogiri::XML::Attr:0x3ff90e073644 name=”id” value=”100">

Which will return the XML Attribute (which inherits from Node)? You can use .value,.text, or .inner_text against the returned object to retrieve the actual value. Notice we’ve used the at_xpath method to select the element. XPath on its own will return a node array (with just one element in this case).

The second method to get a root attribute is to select the root element first using.

root = doc.root

which will again return the XML document

Now we can access the id attribute using a convenient array notation — returning the value immediately, or the XPath statement for an attribute which again will return an XML::Attr object from which we can retrieve the value.


root[“id”]

=> “100”

Or

root.at_xpath(“@id”)

#<Nokogiri::XML::Attr:0x3ff90e073644 name=”id” value=”100">

Since we’re already positioned at the root element of the document, selecting elements beneath the root is also easy

root.at_xpath(“Name”)

=> #<Nokogiri::XML::Element:0x3ff90e072dfc name=”Name” children=[#<Nokogiri::XML::Text:0x3ff90e072bf4 “Movies”>]>

You can use root.at_xpath(“Name”).text to retrieve the text value, but only if you’re absolutely sure the element is present, otherwise you’ll get an undefined method for nil:NilClass exception.

items = root.xpath(“Items/Item”) #You’ll see the xml for our two items output to the console.items.count

=> 2

Another important feature is the // XPath search operator which will search and return all elements at all levels for a matching element name.

doc.xpath(“//Keywords”)

returning an array of Keyword elements across the entire document.

we’ll close our file.

f.close

Of course, the better way to do this in code is to use

File.open(path) do |f| end

block to ensure that the file is closed at the end of our Nokogiri session.

And there you have it.

Hope this post helped you to understand the various techniques available in ruby for XML parsing

--

--