Stand on the Shoulders of Giants
In the quest to become Above Average, it’s easy to give in to the temptation to do everything yourself. What better way to prove you’re better than average by doing everything yourself, right? Just one problem: It’s a trap! And it’s one some coders never escape. The plain truth is, you can’t do it all, and you wouldn’t want to. To raise your programming game, learn to let go of the “Do it All” mentality and instead learn to Stand on the Shoulders of Giants.
So, let’s say you’re building a web application, and somewhere you discover that you need to parse some HTML. What do you do? Well, you’ve got 3 choices:
- Quit your job
- Write some code that parses HTML
- Use someone else’s code that parses HTML
Now, unless the programing language you’re coding in has no business being used for web development or you were already on the verge of walking off the job, you’d probably find that choice #3 is the quickest way out. You wouldn’t write your own HTML parser would you? Parsing HTML correctly looks deceptively easy, but it’s tricky at best. It would take some serious guts to try to write your own HTML parser and still get your app coded in this decade.
And yet, this is exactly what I witnessed several years ago, when a Java developer I met decided that the HTML parsing routines provided by the Java language were somehow not good enough for his application. Today, HTML has progressed well-beyond what the built-in Java HTML parsers can handle, but at the time, what was available would have been more than adequate. I suppose maybe there was something in particular he needed that none of the available libraries could provide, but it’s very hard to believe that this required a completely new parser implementation, rather than extending an existing one. Besides, several open-source parsers were available at the time, just waiting for contributors.
I don’t have to tell you, it didn’t end well for our hero. Several weeks in, he abandoned the overall effort and just started looking for angle brackets and quotation marks. So much for home-brew.
Hang around professional programmers long enough and you’re likely to hear someone toss out the acronym NIH, which stands for “Not Invented Here.” Although it can apply to just about any professional discipline, in the context of software development it’s a distrust towards code or concepts that they did not create by themselves. You’ll often find sufferers of NIH hiding behind concerns about security, maintenance, or compatibility. More often than not, these concerns are really just symptoms of deeper problems, such as mistrust, fear, or an inflated ego, which are just going to hold you back, anyway. It’s time to let that stuff go.
So how can we effectively use this practice of “Standing on the Shoulders of Giants” (henceforth “SOTSOG,” to save me some typing)?
First of all, it helps to make sure we’re asking the right question. For example, maybe we should first ask: Do we really want to parse HTML at all? We answer that by surveying the landscape of our problem. HTML is tortuously complex with its rules, but there may be an alternative path to getting at the data we really want in that HTML document. Doing some research on our problem turns up the fact that HTML is a subset of SGML, which is also a predecessor to XML. Nearly every language worth its salt has an XML parser built-in.
A little more research turns up a project call HtmlTidy, which while intended to help make HTML more standards compliant, also promises to be able to whip your HTML into XML-compliant shape. So perhaps, if we are lucky and have some say in the matter, we could run our HTML through the HtmlTidy processor and get XML out the other side.
Let’s pretend for a moment, however, that we are the original HTML-parsing Java hero I just brought up and we are stuck with free range HTML, not “tidied-up” XML. Where are the giants in this particular arena? Well, a simple Google search for “java html parser” turns up a whole lot of choices, some of which are:
- jsoup Java HTML Parser, with best of DOM, CSS, and jquery
- Jericho Java HTML Parser
- And the mother-lode on Wikipedia: http://en.wikipedia.org/wiki/Comparison_of_HTML_parsers
Note that this is just a smattering of what comes up in such a search as of the time of mywriting. If I was doing a real “market scan,” I’d have a closer look at everything. Chances are, your favorite isn’t here.
Now, just to keep things simple, let’s set aside any licensing concerns, and focus just on what these HTML parsing giants have built for us:
- jsoup is a GitHub-hosted project with 21 contributors, and has been around since 2010, with its latest release in late 2013. It has an active bug tracker and uses StackOverflow as its discussion forum. They provide a full API documentation set, a Cookbook with examples, and a live web-based demo of the parser’s capabilities.
- Jericho looks like a lone-wolf project (single developer “on a mission”) hosted on Sourceforge, started in 2004, with its last release in late 2013. Seems to rely on DOS batch files in its release, which if I were a non-Windows developer might be a turn-off. Sourceforge stats show decent utilization and the developer responds to tickets and posts on the forums. Example code provided.
- TagSoup looks to be self-hosted, with a Google Groups discussion forum that shows little activity. Can’t tell its age, although it claims to have had 20+ releases. Site disclaimer states it won’t build on the newer versions of Java without adjustments. Focuses on forgiving parsing techniques and uses an event-oriented approach called SAX.
- VietSpider appears to have been a loose match on my search, and seems to be a full-fledged web spider, rather than a low-level HTML parser. Looks interesting, though.
Based on the (admittedly cursory) analysis above, I’d probably go with jsoup or maybe Jericho for my first round of evaluation, unless I needed something more happy-go-lucky like TagSoup.
So, what did I gain from this little exercise? Well, I know there are a heck of a lot of HTML parsers out there for Java, several of which have a huge amount of coding time invested. Their community sizes are nothing to sneeze at, either. What’s more, I learned that there are different approaches to parsing HTML (I saw mentions of “nodes” and SAX) and I also found that people are writing their own web crawlers, which may come in handy later if I need one.
Just these few minutes of time have given me a much broader understanding of what’s out there and I also have some great candidates for my project needs.
And, finally, perhaps the biggest win of all when practicing SOTSOG: I don’t own the code.
Why’s that a good thing, you may ask? Simple. The more code I “own,” the more code I have to worry about. Sure, all code can have bugs, even code written by geniuses and reviewed by thousands, but it’s a darn sight more likely to work than code I wrote yesterday, and if I choose wisely the giant’s code I find is pretty much guaranteed to get me to the finish line more quickly and with fewer bugs. And that means fewer support tickets or phone calls.
If the practice of SOTSOG was good enough for Isaac Newton, then it deserves serious consideration. Give it a try, and you’ll find that it does more than help you deliver on time and keep your phone from ringing. It will broaden your professional horizons and help make you an Above Average Programmer.