Faster browser scraping in C# using Selenium and HtmlAgilityPack
I made a small extension class that adds a couple of functions to the IWebDriver class from Selenium and the HtmlNode and HtmlNodeCollection classes for HtmlAgilityPack that allow you to get HtmlNodes from Selenium and to use By selectors with HtmlAgilityPack. I used the Css2XPath Reloaded library by Jon Humphrey to convert By objects to xpaths, which HtmlAgilityPack uses.
Selenium functions send requests to the driver, which introduces latency. If we make one request for the page source and then process it ourselves, we can scrape data much more quickly than if we run individual FindElement functions against the browser.
When is this efficient?
Getting the page source from Selenium takes about as long as executing FindElement(), so this process only makes sense when you need to find a bunch of data on a page. HtmlAgilityPack is much faster at finding elements once you have the source, but it won’t help you click them or do any other kind of interactive behavior. Running a FindNode() command on the IWebDriver instance is actually slower than FindElement() because it both has to communicate with the browser and load it into HtmlAgilityPack before looking for a node, whereas FindElement() looks for the node as its communication with the browser.
The extension functions are:
IWebDriver.GetDocumentNode(): Gets the document node from an IWebDriver instance by loading the page source with HtmlAgilityPack and selecting the root document node.
IWebDriver.FindNode(): Finds an HtmlNode in driver.GetDocumentNode() using an xpath or By object.
IWebDriver.FindNodes(): Finds an HtmlNodeCollection in driver.GetDocumentNode() using an xpath or By object.
HtmlNode.FindNode(): Alias of HtmlNode.SelectSingleNode(), named for consistency. This can also take a By object or xpath.
HtmlNode.FindNodes(): Alias of HtmlNode.SelectNodes(), also named for consistency and also can also take a By object or xpath.
By.ToXPath(): Gets the xpath for a By object.
Ok, enough talk about my simple class. Let’s see this process in action.
I’m going to use IMDB because it’s the classic web scraper test. Mr. Robot seems like a fitting option, right?
Find one element using FindElement() and FindNode() on an IWebDriver instance.
We’ll look for the title element using the xpath “//div[@class=’title_wrapper’]/h1”
Find many elements with the same selector using FindElements() and FindNodes() on an IWebDriver instance.
We’ll look for the actors listed under “Stars” using the xpath selector “//div[./h4[contains(text(), ‘Stars’)]]/a[not(contains(text(), ‘See full cast & crew’))]”. Think my xpaths are garbage? Come at me.
Find many elements with different selectors using driver.FindElement(), driver.FindElements(),driver.FindNode() and driver.FindNodes() on an IWebDriver instance.
We’ll look for the title and stars using the same selectors as above, plus we’ll look for the link to every cast member using the xpath “//table[@class=’cast_list’]//td[not(@class=’primary_photo’)]/a[contains(@href, ‘/name/’)]”
Find many elements with different selectors using FindElement() and FindElements() on an IWebDriver instance and FindNode() on the result of IWebDriver.GetDocumentNode().
This will use the same selectors as test 3.
Since the main time delay of using the built-in IWebDriver functions is the latency of communication with the browser, I expect that the first three tests will result in a slight decrease in speed, while the last test will be significantly faster using HtmlAgilityPack.
Just as expected, the first three tests show a slight decrease in speed when using the Agility functions while the fourth test shows a substantial increase in speed.
It seems that running a selenium function against the driver basically always takes about a second, so if you use driver.GetDocumentNode() in the function call of driver.FindNode() it won’t be any faster than FindElement(), but if you get the document node first and then run several functions against the result you will get a massive increase in speed.
I’ll update this with a link to a github repository later today, but for now here are some screenshots so you can see what’s going on.