Converting HTML documents to Grakn knowledge bases (and back again!)

Nick Powell
Vaticle
Published in
6 min readSep 18, 2017
Image by Hilary Halliwell is licensed under CC BY 2.0

I wrote in my previous post about how Grakn does really well with information that is structured in a hierarchy. In such systems, you have data — let’s call each individual piece of data a node — that leads to more data nodes by way of special relationships. If you’ve ever taken an algorithms class you’re probably pretty familiar with trees of various kinds. In trees, parent nodes lead to child nodes, which are themselves parents of yet more nodes, all the way down the tree until you reach the leaves, which have no child nodes below. Nodes that have the same parent are called siblings.

Grakn models tree-like structures such as this quite well.

In HTML, documents follow this general architecture, where it is called the Document Object Model (DOM for short). In it, the various parts of the code are structured much like the tree you see above, with various sections leading to other sections in a hierarchical manner. When your web browser loads the data from the server, it parses the tree and delivers it to you as a beautiful (or ugly, if the UI developer sucks) webpage:

I thought it would be cool to write a program that took an HTML document and encoded it into a Grakn knowledge base. You could then do analytics on that HTML document, or you could add and remove elements from it — much more quickly and elegantly, I might add, than manually poring through an HTML document and changing everything you can find. You could also change design patterns in the blink of an eye. Then, when you’re done, you could decode it back into HTML and load it in your browser. Pretty neat, right?

Disclaimer: the program is a sort of proof-of-concept, and doesn’t always work for non XHTML documents (i.e. where XML rules do not apply). For example, link and meta tags (and several others) do not take a closing backslash in non-XML documents, so there is no end tag and the program thinks that every successive link is a level deeper in the DOM tree as a result. To fix this, you would need to recognize the type of document you have and whether these elements take end tags.

The GitHub repo is below!

Encoding

Every Grakn project begins with defining a schema! Here, everything revolves around the HTML tags, which are the modular elements that give function to the various parts of the document. There are many tags out there — so many, in fact, that it would be unreasonable to define each one schematically. At the same time, you want to define as many as possible so that you can actually do meaningful things with the document in Grakn. The same thing goes with the tag attributes — some tags have many possible attributes, so it was prudent to define as resources only the commonly used ones, and preserve the others in a generic “other” attribute that could then be used to re-construct the original document.

I ended up with 18 defined HTML tags subclassed from an entity called “global”. Since most HTML elements take attributes from a global superclass, this makes sense. I also included a “container” entity for all other tags, as well as a “data” entity for the text content of a node.

Adding Grakn support for tags or resources not included in this is very simple — you would just have to add it to the schema and the default_tags dictionary, and then you could query Grakn for information on that particular tag or attribute. Below you can see the schema’s default tags, as well as the default attributes:

Parsing the DOM Tree

I used Python, and specifically the HTMLParser module, to do the parsing. You feed it the document and it executes methods to handle each type of element it comes across — start tags, end tags, data content, even declarations.

I custom overrode these methods in domEncoder.py so that the program could properly insert each element into Grakn and perform the necessary housekeeping.

Relationships

I took the parent-child relationship inherent to tree structures and made that a Grakn relationship. I also made siblinghood a relationship, so that a sequence of elements like the links below

would be siblings in the order they appeared in the document. This is because, for the decoding step, it matters in which order a node’s children are visited. Otherwise you would get elements of the webpage appearing all over the place!

When you encode a page, the program gives the URL you provide a unique hash, which is then the keyspace of your knowledge base.

Decoding

Now that you know how a website is encoded in Grakn, it’s not difficult to imagine how the decoding step works.

Essentially, using the Grakn Client Python (which you can find here!), we recursively step through the graph, looking for children and right siblings, and iteratively extending a string representing the HTML document as it is being processed. Once every element in the keyspace has been un-parsed (if that’s a word), we have the final string, which we save as the HTML document.

A couple notes:

  • As mentioned above, documents that don’t follow XML specifications are likely to be messed up.
  • There are a lot of edge cases in HTML, and although I’ve tried to find as many as I could, I’m sure I haven’t found them all. Keep that in mind if the program doesn’t work (and consider it an exercise to find the edge cases and fix it!)

To Sum Up

It’s not much use for me to only explain how this program works — it’s something that you have to use yourself! Hope you like this idea. Overall, I think there are a lot of interesting things that could be accomplished with this, and would love to hear your feedback.

If you enjoyed this article, please hit the clap button below, so others can find it, too. Please get in touch if you’ve any questions or comments, either below, via our Community Slack channel, or via our discussion forum.

Find out more from https://grakn.ai.

Feature Image credit: “Tree”by Hilary Halliwell is licensed under CC BY 2.0

--

--