Understanding NSXMLParser in Swift (xcode 6.3.1)

Disclaimer

I’m a designer who started to learn Swift a few weeks ago, and will be showing the pitfalls I encountered with the parser while trying to read from an XML file (in order to put this data into a tableView). Please pardon my lack of good code, and comments on how to code better (and on how not to repeat myself) are very much welcome.

The basic setup

I’m assuming you already setup your TableViewController to read from a source and populate the table (as well setting the cell identifiers, methods for reusing cells, etc). But to parse the XML we need a few things first.

Set your Controller class as the delegate for parsing, and create a parser:

class PodcastTableViewController: UITableViewController, NSXMLParserDelegate {
var xmlParser: NSXMLParser!
//rest of code for your controller here
}

Create a function to parse your xml (also inside your Controller). Here you define the URL, create the request and send it asynchronously so the downloading and parsing of the xml doesn’t stop your app from responding:

func refreshPodcasts() {
let urlString = NSURL(string: “http://www.blubrry.com/feeds/onorte.xml")
let rssUrlRequest:NSURLRequest = NSURLRequest(URL:urlString!)
let queue:NSOperationQueue = NSOperationQueue()

NSURLConnection.sendAsynchronousRequest(rssUrlRequest, queue: queue) {
(response, data, error) -> Void in
self.xmlParser = NSXMLParser(data: data)
self.xmlParser.delegate = self
self.xmlParser.parse()
}
}

And, lastly, call that function when your viewDidLoad:

override func viewDidLoad() {
super.viewDidLoad()
refreshPodcasts()
}

The XML file

I’ll be trying to make an app for the podcast of a friend, from oene.com.br. Here’s the XML file I’ll be using: http://www.blubrry.com/feeds/onorte.xml.

If you look at the xml file structure, you’ll notice the podcasts have several elements inside them. I’m only interested in some of them, so I define the basic structure for my podcasts in a class:

class Podcast: NSObject {
var podcastTitle = String()
var podcastDate = String()
var podcastLinkInfo = PodcastLinkInfo()
var podcastDuration = String()
var podcastSubtitle = String()
var podcastDescription = String()
var podcastAudio = NSData()
}

Date is a string because I’m not interested in understanding the date, only showing it. podcastLinkInfo is a class with variables to hold the link, file type, etc of the podcast. podcastAudio will be the actual file data, which I still have no idea how to do. Also, I create an empty array to hold all my podcasts:

var podcasts = [Podcast]()

The file is poorly structured, which makes for an even better case. For example, we have <title> for the general podcast title, and we also have a <title> for each podcast. To solve this, we need to understand the delegate methods of the parser.

4 delegate methods to parse them all

The 4 delegate methods I’m focusing on are didStartElement, foundCharacters, didEndElement and didEndDocument. Here are some important things to know about these methods:

1: didStartElement happens every time the parser finds a <key>

So here:

<item>
<title>#11 - Mídia</title>
<link>http://www.blubrry.com/onorte/2635779/11-mdia/</link>

It calls didStartElement on <item>, then on <title> then on <link>

2: didEndElement happens every time the parser finds a </key>

Similar to the above.

3: foundCharacters happens every time the parser enters a <key>

And foundCharacters will stop on line breaks and “special characters.” So in the same example as above:

<item>
<title>#11 - Mídia</title>
<link>http://www.blubrry.com/onorte/2635779/11-mdia/</link>

foundCharacters will be called on “#11 — M” and then on “ídia”. It happens with a lot of other characters, such as <, >, &, and the list goes on. This means that the contents of your key will very likely be parsed in fragments.

4: The parser runs through the whole XML file once, and only once.

This might seem obvious, but it’s important to understand: the parser proceeds linearly through the xml document, and doesn’t “finish” an element before starting another. So in the following example:

<item>
<title>#11 - Mídia</title>
<link>http://www.blubrry.com/onorte/2635779/11-mdia/</link>
</item>

Here’s what will happen:

  1. didStartElement will be called on <item>
  2. didStartElement will be called on <title>
  3. foundCharacters will be called on “#11 — M”
  4. foundCharacters will be called on “ídia”
  5. didEndElement will be called on </title>
  6. didStartElement will be called on <link>
  7. and so on…

Once you understand this – the fact that it progresses linearly calling the methods – it becomes much easier to figure the parser out.

Parsing the data

First of all we need some variables for the data we’re parsing, as well as some variables to figure out where we are in the xml file. In my Controller I write:

 var entryTitle: String!
var entryDate: String!
var entryURL: PodcastLinkInfo!
var entryDuration: String!
var entrySubtitle: String!
var entryDescription: String!
 var currentParsedElement = String()
var weAreInsideAnItem = false

var currentParsedElement and var weAreInsideAnItem will become clear(er) in the following examples. Also, in order to keep the code shorter I will only be pasting the code for the entryTitle and entryDescription properties of the podcasts.

1: didStartElement

func parser(parser: NSXMLParser,
didStartElement elementName: String,
namespaceURI: String?,
qualifiedName: String?,
attributes attributeDict: [NSObject : AnyObject]){
if elementName == “item” {
weAreInsideAnItem = true
}
if weAreInsideAnItem {
switch elementName {
case “title”:
entryTitle = String()
currentParsedElement = “title”
case “itunes:summary”:
entryDescription = String()
currentParsedElement = “itunes:summary”
default: break
}
}
}

You saw me declare weAreInsideAnItem = false before this method above. This is because there are some keys I’m interested in (such as <title> and <itunes:summary>) duplicated outside the podcast items. In order for the parser to ignore these keys outside the podcast items, I first wait for the parser to didStartElement “item”. Then I know weAreInsideAnItem = true. Then I tell the parser to only parse the elementName “title” if weAreInsideAnItem.

So, when the parser starts the elementName “title”, if weAreInsideAnItem, it will empty out the entryTitle variable (which might have not been used yet, or which might be holding previous values) and set the currentParsedElement = “title”. It might be DRYer (or avoid spelling spelling errors) if I said:

switch elementName {
case “title”:
entryTitle = String()
currentParsedElement = elementName

But I’m writing out “title” and “itunes:summary” to make things better to read. Notice “title” and “itunes:summary” are references to the elementNames in the xml file.

Also notice that if I wanted to capture attributes of the element, such as the following:

<enclosure url=”http://media.blubrry.com/onorte/content.blubrry.com/onorte/ON-2015-04-11-007.mp3" length=”68689765" type=”audio/mpeg” />

I would do so now, inside didStartElement, because they are part of the “elementStart” (i.e. the opening <key>) instead of foundCharacters after the element started.

2: foundCharacters

Here the currentParsedElement variable becomes clearer:

func parser(parser: NSXMLParser, foundCharacters string: String?) {
if weAreInsideAnItem {
switch currentParsedElement {
case “title”: {
entryTitle = entryTitle + string!
}
case “itunes:summary”: {
entryDescription = entryDescription + string!
}
default: break
}
}

The check for if weAreInsideAnItem might be redundant. Anyone feel free to correct me on this, but: all “foundCharacters” are inside elements, so they will only be called inside an element that didStart. But later I might want to capture other elements (such as page title and description) that will be outside <item>, so I’m just trying to make my life easier when I expand my code.

In the previous block, didStartElement, we emptied out the variable for the element we’re using. In this block, we say

entryTitle = entryTitle + string!

Why? Because foundCharacters will escape on line breaks and various characters. We want to constantly append these strings to our entryTitle and entryDescription, until the parser calls didEndElement.

3: didEndElement

Lastly, we check when the Elements end:

func parser(parser: NSXMLParser,
didEndElement elementName: String,
namespaceURI: String?,
qualifiedName qName: String?) {
if weAreInsideAnItem {
switch elementName {
case “title”: {
currentParsedElement = “”
}
case “itunes:summary”: {
currentParsedElement = “”
}
}
if elementName == “item” {
var entryPodcast = Podcast()
entryPodcast.podcastTitle = entryTitle
entryPodcast.podcastDescription = entryDescription
podcasts.append(entryPodcast)
weAreInsideAnItem = false
}
}

Again, if weAreInsideAnItem is checked because, for this first version, I’m not interested in elements outside of the actual podcast items. In each case, I set the currentParsedElement = “”. Why is this?

didStartElement and didEndElement automatically check when elements start and end, but our foundCharacters doesn’t have access to the element name. Instead, we’re using the currentParsedElement to check that. Observe the following snippet of our xml file:

<item>
<title>#11 — Mídia</title>
<link>http://www.blubrry.com/onorte/2635779/11-mdia/</link>
<guid>http://www.blubrry.com/onorte/2635779/11-mdia/</guid>
<comments>http://www.blubrry.com/onorte/2635779/11-mdia/#comments</comments>
<dc:creator>Oene</dc:creator>
<category>Podcast</category>
<pubDate>Sat, 11 Apr 2015 18:35:51 -0400</pubDate>
<description>&lt;p&gt;As demissões recentes na Folha e Estadão, junto com a crise de confiança na mídia brasileira nos faz pensar: o que deu errado? Onde o jornalismo está mais ou menos dando certo? Pedro Burgos, Leandro Beguoci e Edu Acquarone discutem sobre o modelo americano — e como podemos encontrar nele algumas saídas.&lt;span style=&quot;white-space: pre;&quot;&gt; &lt;/span&gt;&lt;/p&gt;</description>
<enclosure url=”http://media.blubrry.com/onorte/content.blubrry.com/onorte/ON-2015-04-11-007.mp3" length=”68689765" type=”audio/mpeg” />
<itunes:duration>1:00:26</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:author>Oene</itunes:author>
<itunes:subtitle>As demissões recentes na Folha e Estadão, junto com a crise de confiança na mídia brasileira nos faz pensar: o que deu errado? Onde o jornalismo está mais ou menos dando certo? Pedro Burgos, Leandro Beguoci e Edu Acquarone discutem sobre o modelo …</itunes:subtitle>
<itunes:summary>As demissões recentes na Folha e Estadão, junto com a crise de confiança na mídia brasileira nos faz pensar: o que deu errado? Onde o jornalismo está mais ou menos dando certo? Pedro Burgos, Leandro Beguoci e Edu Acquarone discutem sobre o modelo americano — e como podemos encontrar nele algumas saídas. </itunes:summary>
</item>

Observe how the variable currentParsedItem will change:

  1. didStartElement will be called on <item>. Here, we will be setting currentParsedElement = “item”.
  2. didStartElement will be called on <title>. Here, we will be setting currentParsedElement = “title”.
  3. foundCharacters will append various strings to entryTitle (because our currentParsedElement == “title”).
  4. didEndElement will be called, and unless we set currentParsedElement = “”,
  5. every single foundCharacters will be added to our entryTitle, until currentParsedElement is re-set in the end of our snippet, inside didStartElement “itunes:description”.

So, that is why we clear out the variable. Then, still inside didEndElement, if the element is “item” we know we’ve finished parsing one podcast entry. So we:

  1. instantiate our Podcast class with var entryPodcast = Podcast()
  2. append the title and description to our newly created Podcast
  3. append this podcast to our podcasts array, in our Model.
  4. set weAreInsideAnItem = false

4: didEndDocument

Last thing we need to do is send a request back to our main queue when the parse is finished, in order to reload our table data with the model we’ve created:

func parserDidEndDocument(parser: NSXMLParser){
dispatch_async(dispatch_get_main_queue(), { () -> Void in
self.tableView.reloadData()
})
}

The End

I hope this helped someone with the NSXMLParser. I had a pretty hard time figuring out how it worked. I’m not getting into how to populate a tableView with the data (since most people coming here are those with problems with the parser, not the tableview), but there are plenty tutorials on how to populate a tableView with data from an array. And after the instructions above you should have an object called podcasts in your model, with an array of Podcast() objects inside, so it shouldn’t be much trouble.

Keep in mind, although the solution above shows how to circumvent some pitfalls in XML Parsing, each XML file is a unique snowflake on the internet, and you will most likely have to adapt and improvise in order to correctly parse other XMLs.

Good luck to all, and don’t rage-quit.

Show your support

Clapping shows how much you appreciated Lucas Cerro’s story.