Parsing HTML in Dart with Html package.

Jonathan Monga
Flutter Community
Published in
6 min readMar 31, 2020

It looks like that, an HTML code on a website. Look again at the photo, I admit that the search for a yellow wreck will be difficult, and it is the same for the search for data in an HTML code.

So when it comes to extracting data from a website with Dart, the Html package is a solution. Html is an open-source Dart package used primarily for extracting data from HTML. It also allows you to manipulate and produce HTML output. In constant development by the Dart team, weirdly we hardly talk about it. The Html package can also be used for parse and build XML. It’s a port of html5lib from Python.

In this tutorial, we will use Github to illustrate the scrap exercise which demonstrates follows functionalities of the Html package :

  • Parsing: parsing the HTML into a Document;
  • Filtering: selecting the desired data into Elements and traversing it;
  • Extracting: obtaining attributes, text, and HTML of nodes;
  • Modifying: adding, editing, removing nodes and editing their attributes.

1. Dependencies

Add the Html and Http dependency in our pubspec.yaml file :

Dependencies

A quick word on the Http package, this package provides a Future-based API for making requests.

Now install all of these packages with this command :

command

2. Html package at a glance

The Html package does not load the HTML page, it is the task of the Http package, on the other hand, it starts to build the DOM tree corresponding to the call to the parse method. This tree works the same as the DOM in a browser, offering methods similar to jQuery and JavaScript vanilla for selecting, browsing, manipulating text | HTML | attributes and adding | removing elements.

If you are comfortable with client-side selectors and DOM traversal | manipulation, you will find the Html package very familiar. Check how easy it is to print the number of paragraphs on a page:

Keep in mind that the Html package interprets only HTML — it does not interpret JavaScript. Therefore, changes to the DOM that would normally occur after the page loads in a JavaScript-enabled browser will not be visible with the Html package.

3. Parsing

The parsing phase of the HTML into a Document. the Html package guarantees the parsing of any HTML, from the most invalid to the totally validated ones, as a modern browser would do.

Let’s parse the response’s body and get a document. Note that the response’s body comes from my Github page :

4. Filtering

Now that we’ve converted the HTML code to a document, it’s time to go through it and find what we’re looking for. This is where the resemblance to jQuery / JavaScript is most evident since its selectors and its displacement methods are similar. Let’s look at some of the most useful filters below.

4.1. Selecting

All a document selection methods receive a string representing the selector, using the same selector syntax as in CSS or JavaScript, and retrieves the corresponding list of elements. This list can be empty but not zero.

Let’s take a look at some selection methods :

You can always use more explicit methods inspired by the browser DOM.
Even if the Element class is not a Document superclass, the Element and Document classes have almost the same methods, you can learn more about using selection methods in the API reference.

4.2. Traversing

Traversing means navigating across the DOM tree. the Html package provides methods that operate on the Document, on a set of Elements, or on a specific Element, allowing you to navigate to a node’s parents or children.

Also, you can jump to the first, the last, and the nth (using a 0-based index) Element in a set of Elements:

5. Extracting

We now know how to reach specific elements, so it’s time to get their content, namely their attributes, HTML, or child text.

Take a look at this example which selects my name on Github page:

Here are some tips to bear in mind when choosing and using selectors:

  • Rely on the “View Source” feature of your browser and not only on the page DOM as it might have changed (selecting at the browser console might yield different results than the Html package)
  • Know your selectors as there are a lot of them and it’s always good to have at least seen them before; mastering selectors takes time
  • Be less dependent on page changes: aim for the smallest and least compromising selectors (e.g. prefer id. based).

6. Modifying

Editing includes defining attributes, text, and HTML for elements, as well as adding and removing elements. This is done on the DOM tree previously generated by the Html package — the Document.

6.1. Add Element to DOM

Even if the Html package does not provide methods to do the modification directly, you can modify your by doing this :

6.2. Removing Elements

To remove items, you must first select them and execute the remove() method :

6.3. Converting the Modified Document to HTML

Finally, since we were changing the Document, we might want to check our work.

To do this, we can explore the Document DOM tree by selecting, traversing, and extracting using the presented methods, or we can simply extract its HTML as a String using the outerHtml attribute of a Document :

The String output is a pure HTML.

7. Conclusion

I know the journey has been long, but I end with these words, the Html package is an excellent library for scraping any page. If you are using Dart and do not require browser-based scraping, this is a library to consider. It is familiar and easy to use because it uses the knowledge you can have on front end development and follows good practices and design patterns.
Thanks for reading, I hope you liked this article, please don’t hesitate to leave a comment below.

--

--

Jonathan Monga
Flutter Community

Java dev | Dart Dev | Android developer | Flutter developer | Speaker.