Extracting Web Page Metadata with mget

Photo by Markus Spiske temporausch.com from Pexels

When exploring archives coming from online services for my Data Portability Kit project, one of the most common and basic tasks I was confronted with was to analyse and fix URLs-related content.

URLs form the basic building bricks of the Web. Google started by exploring URLs, to extract meaning from links. As itself, a link already conveys some basic pieces of information.

Later on, the World Wide Web Consortium (W3C) and the Semantic Web worked on expanding the amount of information a link could convey. Since then, a large number of handy specifications have been produced to make the Web richer: RDF, RDFa, Microformats, etc.

The Semantic Web comes with quite a reputation. It is said to be complex or useless. However, the reality is that the technology is heavily used today. It is used by Facebook and Twitter to extract info on links you share. It is used by Google and other search engines to make data indexing more relevant and improves search graph.

When you want to find an accurate title for a link that have been truncated by an online service, you can in most of the case retrieve it from the URLs it points to. In the process, you can use the opportunity to extract more data as they are provided by those standards. I do not want to limit myself to title extraction.

That's what has led me to write mget a simple tool to explore the metadata hidden in web pages. The tool is pretty basic for now. It is limited to extracting global page-level information, but I will extend it as I keep on playing with the Semantic Web.

mget is like wget, gut focusing on extracting page metadata.

mget is a Go tool that you can easily install with Go command-line tool:

$ go get -u github.com/processone/dpk/cmd/mget

Assuming you have ~/go/bin in your path, you can then run it with:

$ mget https://www.process-one.net
{
"properties": {
"description": "ProcessOne delivers rich Messaging, IoT and Push services that will help your business grow.",
"og:description": "ProcessOne delivers rich Messaging, IoT and Push services that will help your business grow.",
"og:image": "https://static.process-one.net/bootstrap/img/art/p1.jpg",
"og:title": "Build Awesome Realtime Software with ProcessOne",
"og:type": "product",
"og:url": "https://www.process-one.net/en/",
"title": "Build Awesome Realtime Software with ProcessOne",
"twitter:card": "summary_large_image",
"twitter:description": "ProcessOne delivers rich Messaging, IoT and Push services that will help your business grow.",
"twitter:image": "https://static.process-one.net/bootstrap/img/art/p1.jpg",
"twitter:site": "@processone",
"twitter:title": "Build Awesome Realtime Software with ProcessOne"
}
}

You can find the source code on Github.

Please, let me know if you would like me to offer prebuilt mget binary to download from Github.

Enjoy !


You can check more advanced usage of mget in follow-up article: Exploring mget: More fun with Web pages metadata.