Crawler data from a website with Html Agility Pack (.NET / C#)

Beribey
Beribey
Feb 25, 2019 · 3 min read
Crawler data from a website with Html Agility Pack (.NET / C#)

This is my first tutorial on the medium. Currently, the demand for data collection is increasing. For some big sites like Facebook, Google, steam we can use the API provided by them to get data. In many other cases, we often extract data manually (Open up the website, copy data to word, excel v … v files), this is both extreme and takes a lot of time and effort.

This week, I received a project from my teacher, I need you to make a newspaper reader application, get information from a website. Suppose it is a pretty big forum page, and of course, there is no API to get data. Here, we cannot get the data manually. The only solution for this is to write software that extracts data from the site itself.
I will guide you to extract by using HTMLAgilityPack and Fizzler libraries. HTMLAgilityPack is a powerful HTML parse library, the reason it is popular is that it using most HTML, both valid and unvalid (In fact, the number of websites with unvalid HTML is endless, libraries Other will be error-prone, HTMLAgilityPack is not). The knowledge in this article will be quite useful if you later need to extract information from another website. You can google more with the keyword: web crawler.

Step 1: Create a new project. Here I’m creating a new Console App.

Step 2: Installing Fizzler and Html Agility Pack.
Go to Tools -> Library Package Manager -> Package Manager Console. Type the following command to install the library:
Install-Package Fizzler.Systems.HtmlAgilityPack
Or you come to NuGet Package, come to “browse” and find with keyword: Fizzler.Systems.HtmlAgilityPack and install it for your solution. After installation, if you see there are 3 references as shown below, ok.

I will explain some objects, methods of HTML Agility Pack.
–HTMLDocument: This is a class of information about an html file (encoding, innerhtml). We can load data into HTMLDocument from a URL or from a file.

–HTMLNode: An HTMLNode is equivalent to a tag (li, ul, div, etc.) in HTML. The largest node containing all will be DocumentNode. Some properties of HTMLNode that we often use:

Name: The name of the node (div, ul, li).
Attributes: The list of notes (Attribute is the information of the node such as: src, href, id, class …)
InnerHTML, OuterHTML: Easy
SelectNodes (string XPath): Find the child nodes of the current node, based on the XPath inserted.
SelectSingleNode (string XPath): Find the first child node of the current node, based on the input xPath.
Descendants (string XPath): Returns the list of child HTMLNode of the current node.

And we can use Fizzler for easy. Fizzler supports CSS selector, allowing us to use CSS selector. Fizzler is expanded based on HTMLAgilityPath, adding the following two functions to HTMLNode:

QuerySelectorAll: Find the child nodes of the current node, based on the input css selector.
+ QuerySelector: Find the first child node of the current node, based on the input css selector.

Step 3: Writing code, my experience is that you should choose an id-based tag, located near the data you need to get the most.

var html = new HtmlDocument();
var document = html.Load(“your page”);
var items = new List<object>();

var threadItems = document.DocumentNode.QuerySelectorAll(“div.title-wrap”).ToList();
foreach(var item in threadItems)
{
var title = item.QuerySelector(“a”).InnerText;
var link = item.QuerySelector(“a”).Attributes[“href”].Value;
items.Add(new { title, link});
}

To understand more about CSS selector, you can read the link below.
https://www.w3schools.com/cssref/css_selectors.asp

Step 4: Export the results somewhere. At this point everything is done, you can save the results to a database file or export a text file depending on the intended use.

Hope this article will be useful for your programming career.

Coderes

From coder to a developer. From my living with coding.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store