Stop Parsing (X)HTML with Regular Expression

Don’t parse (X)HTML with Regular Expression (RegEx); Consider a premade HTML/XML parser instead

punyavashist
Aug 25, 2017 · 2 min read

What started out as a normal and regular-looking answer at StackOverflow, albeit sounding a bit subjective and rant-y quickly devolved into a textual art. A performance pulled smoothly, that no one dared to edit (or even flag) it as incorrect.

I’m talking about the famous reply at StackOverflow — a reply to the question “how to parse HTML using regex?”

While the post tends toward the sarcastic and humorous side, there’s a message repeated over and over again in the post. You shouldn’t use regex to parse HTML.

I, myself, have received countless of messages regarding this particular topic. It’s safe to say it’s all white-noise nowadays, but I had to repeat the same answer as if I’m a voicemail.

Why using RegEx to parse (X)HTML is not a great idea

Sure, we have been there sometimes. It’s just to retrieve some information and store it somewhere else, something as simple as regex can definitely handle that. I mean, what could go wrong?

**Many. Many things could go wrong.**

The imminent danger of using regex to parse a context-free language (HTML) is as clear as day. Regex, as the name suggests, parses regular languages. HTML on the other hand is a totally different beast to handle.

It’s easy when you just want to be dirty and use regex to find a pattern as easy as an URL, for example. Mostly because you probably know the basic forms of URL, it either starts with “http://”, “https://”, “www” or it ends with a specific domain (.com, .org, etc).

However, when you try to be specific and say you want to match URLs that are not commented out (<!– Like this –>), and the URL has to be in some very specific element in with a specific classname, that’s when you should start looking for alternatives.

Regex simply won’t cover all the possible HTML structures either. Seeing regex is unable (or more accurately, shouldn’t) to parse a nested tags, you should immediately think that you shouldn’t use regex to parse HTML.

You’d spend hours figuring out and trying patch up all plausible patterns you may think of. And it is possible to parse HTML using regex, depending on your use-case, seeing that some people do know exactly what they need and thus can quickly filter out the scope of their regex pattern.

thecyberfibre

The one and only community-based publication on Medium catered towards Developers and Designers. In order to have your work featured, contact @orvymm on Telegram, tweet at him or drop orvymm@protonmail.ch a mail.

)
punyavashist

Written by

@punyavashist on #twitter, #xda, #hackernoon, #telegram, #googleplus, #github and lemme know where else, i seem to have lost track.

thecyberfibre

The one and only community-based publication on Medium catered towards Developers and Designers. In order to have your work featured, contact @orvymm on Telegram, tweet at him or drop orvymm@protonmail.ch a mail.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade