Stop Parsing (X)HTML with Regular Expression
Don’t parse (X)HTML with Regular Expression (RegEx); Consider a premade HTML/XML parser instead
What started out as a normal and regular-looking answer at StackOverflow, albeit sounding a bit subjective and rant-y quickly devolved into a textual art. A performance pulled smoothly, that no one dared to edit (or even flag) it as incorrect.

I’m talking about the famous reply at StackOverflow — a reply to the question “how to parse HTML using regex?”
While the post tends toward the sarcastic and humorous side, there’s a message repeated over and over again in the post. You shouldn’t use regex to parse HTML.
I, myself, have received countless of messages regarding this particular topic. It’s safe to say it’s all white-noise nowadays, but I had to repeat the same answer as if I’m a voicemail.
Why using RegEx to parse (X)HTML is not a great idea
Sure, we have been there sometimes. It’s just to retrieve some information and store it somewhere else, something as simple as regex can definitely handle that. I mean, what could go wrong?
**Many. Many things could go wrong.**
The imminent danger of using regex to parse a context-free language (HTML) is as clear as day. Regex, as the name suggests, parses regular languages. HTML on the other hand is a totally different beast to handle.
It’s easy when you just want to be dirty and use regex to find a pattern as easy as an URL, for example. Mostly because you probably know the basic forms of URL, it either starts with “http://”, “https://”, “www” or it ends with a specific domain (.com, .org, etc).
However, when you try to be specific and say you want to match URLs that are not commented out (<!– Like this –>), and the URL has to be in some very specific element in with a specific classname, that’s when you should start looking for alternatives.
Regex simply won’t cover all the possible HTML structures either. Seeing regex is unable (or more accurately, shouldn’t) to parse a nested tags, you should immediately think that you shouldn’t use regex to parse HTML.
You’d spend hours figuring out and trying patch up all plausible patterns you may think of. And it is possible to parse HTML using regex, depending on your use-case, seeing that some people do know exactly what they need and thus can quickly filter out the scope of their regex pattern.


