Anatomy of the semantic web

Published in

Globant

9 min readSep 22, 2022

Can there be a smarter web where these software agents, while roaming from page to page, can connect the documents with the help of associated meta information and then infer meaningful information from it?

Once upon a time, in fairy tales, there was a magic mirror. The mirror had all the knowledge and it could answer any questions put before it.

In our times, we have the world wide web which we use for almost every/any-thing. The search engines perform a keyword search and return a list of documents as result. Reading through the documents, understanding them, and deriving an answer to our question- is a task left to us, unlike the magic mirror of tales. Not very magical! Right?

Finding techniques to make computers understand things on the web better has kept computer scientists busy for a very long time. Semantic Web (also referred to as Web 3.0), a vision by Tim Berners-Lee, provides an effective solution to the inadequacies of the current web.

Fancy term! But What can it actually do?

Semantic Web can help us solve some actual problems like:

Treatment for ailments & drug discovery
In the semantic web, medical professionals maintain their knowledge about symptoms, diseases, and treatments in the machine-understandable form of some documents on the web. Additionally, pharmaceutical companies keep their information about drugs, dosages, and allergies in a similar format.

Now imagine if a hospital can use a natural decision support system that combines the knowledge kept in these documents along with the patient data. Such a system can suggest treatments and monitor drug efficacy & possible side effects. Cool, isn’t it?

Ticketing by virtual agents
Airlines keep their schedule, ticket prices, etc in the semantic form on the web. Using this data along with the employee data in a company, a virtual assistant books tickets taking into consideration travelers choices, shortest route, etc.

Sounds Interesting! So what is the Semantic Web?

The semantic web is an extension of the current web in which information is connected and it is given an unambiguous, well-defined meaning to enable computers to infer or deduce something from the wealth of documents available on it.

This article is an attempt to unravel the concept of the semantic web and what is required to realise this dream using w3c concepts and technologies like RDF, RDFs, RDFa, OWL, SparQL and Protege.

Are we talking about reengineering the web?

On one hand, we have millions of documents spread all over the internet and on the other hand, we are talking about tweaking the software agents so that they become smart enough to give us processed inferred information.

There are broadly two approaches to solving this problem-

Solution Approach 1:

Leave the web documents as is and make the search engines wiser.
It would be wonderful if we can just work on the machine agents and make them powerful enough to let them start deducing facts from one or more documents. This idea does not ask for changing anything in the vast set of web documents, hence it is very alluring.

The folly in this idea is the fact that a single word or sentence can have different meanings based on the context in which it is used. For example, the word bank has a very different meaning in the following sentences-

We humans, while reading the above sentences, apply our cognitive capabilities to understand the context in each sentence and by that we comprehend the sentences correctly. For machines too, we would have to provide the context in some form so that they also understand sentences the way we do. Now the real question arises, where do we store the context so that it is readily available to the machines while they are reading the sentence? The answer to this question is available in the second approach.

Solution approach 2:

Add the contextual information in the documents and then tweak the search engines to decipher them.

For expressing the context data in a sentence, the Semantic web proposes a framework called RDF (Resource Description Framework). As per the RDF framework, a sentence in a web document is logically divided into three parts- the subject, the object, and the predicate — which are called triples.

For example, let’s take a look at the following two sentences-

The framework suggests giving unique URIs (Uniform Resource Identifier) to the subject, the predicate, and the object elements in sentences, to make them unambiguously identifiable.

The URI for “Europa” can be a web page describing what Europa is. Similarly, there can be a page for “Jupiter” which will have details about it. The predicate “is a satellite of” can be uniquely identified in a vocabulary that will be universally known. In the second sentence “John” and “Maria” can have their own unique pages identified by URIs and the predicate “is an acquaintance of” can be defined in another unique vocabulary that talks about relationships.

Let’s talk about some actual code to embed the RDF data (context metadata)!

The metadata for the triples in the sentence will not be visible to a human user but they should be kept in the web page as special HTML attributes for helping the search engines and other machines. The following web document is an HTML enriched with RDF data added as attributes to each tag.

In the document above, a few new HTML attributes like vocab, property and typeof can be observed. They come from a technology named RDFa in which ‘a’ stands for HTML attributes. RDFa gives a set of attributes that allow incorporating RDF data into an HTML document. Note that the RDFa attributes are not for human consumption, they are there to express machine-readable data in Web documents.

The values homepage, name, Person and knows are words that have come from the “Friend Of A Friend” (in short FOAF) vocabulary which is a well defined vocabulary of machine-understandable terms located at the URI “http://xmlns.com/foaf/0.1/".

Do we have different vocabularies for the semantic web?

A vocabulary is a knowledge model which defines a set of concepts and the relationships between those concepts within a specific domain. To put it in simple words, vocabularies are nothing but a glossary of terms for a specific domain defined in a machine-understandable way. Vocabulary are also known as ontologies. Ontology is a sub-branch of Metaphysics which in turn is a branch of philosophy. Ontology deals with the study of being.

In the above example, a vocabulary of FOAF is used to define the predicates in sentences.

Ontologies are often domain specific. For example, the ontologies in the field of health care would be different from those defined by engineers. Medical professionals use their specific ontology to represent knowledge about symptoms, diseases, and treatments. Pharmaceutical companies use their own ontology to represent information about drugs, dosages, and allergies.

Two Ontology defining languages exist today- RDFs and OWL.

Namespaces of RDF and OWL respectively: http://www.w3.org/2000/01/rdf-schema#, http://www.w3.org/2002/07/owl#

Let’s take a look at some of the popular ontologies:

FOAF: FOAF (an acronym of friend of friend) is an ontology describing persons, their activities and their relations to other people and objects. It has words or elements such as Foaf:surname, foaf:firstname, foaf:name, foaf:family, foaf:knows etc.

DC: Dublin Core (DC) is an ontology which has words to create a digital “library card catalog” for the Web. Dublin Core is made up of elements that offer expanded cataloging information and improved document indexing for search engine programs.

SIOC: The Socially Interconnected Online Communities (SIOC) ontology is used to describe online communities such as forums, blogs, mailing lists, wikis. It complements FOAF by stressing on the description of the products of those communities (posts, replies, threads, etc). SIOC is a lightweight ontology (17 classes, 61 object properties, 25 datatype properties).

What are the special skills we need to make the web Semantic?

The semantic web needs 3 kinds of developers.

1) The semantic data producers:

The first kind of developers are those who attach RDF data onto the web pages so that an ordinary web page becomes a semantic web page. They need to know RDFa, RDFs, OWL, and some popular vocabularies like the Friend-of-a-friend (FOAF) vocabulary, the dublin core (DC) vocabulary, the creative commons (CC) vocabulary, etc. Seeing the vastness of web pages we have on the internet today, the world needs a huge number of such developers.

2) The ontology miners:

These are the ones who create new domain-specific ontologies. For this, they use technologies like Protege which is a visual tool to create a new vocabulary. These developers require a deeper understanding of OWL and they also need deep domain knowledge.

3) The semantic data consumers:

The third type of semantic developers are the ones who enhance search engines and create other intelligent machine agents to parse RDF metadata in HTML documents, use the query language SPARQL to derive meaning out of it, and do inference.

Alright! Is someone actually working on this?

The BBC website for the 2010 World Cup was a notable example of the adoption of Semantic Web technologies. Every player in every team had their own web page, and the ease with which search engines could traverse from one piece of content to the next was remarkable.

Facebook is using an abbreviated version of RDFa. They want to make it as simple as possible for tens of thousands of developers to place semantic metadata inside their web pages.

Google’s search engine has adopted Knowledge Graph since the year 2011, which has allowed us to enjoy the rich snippets in search results. Nowadays the Google search engine does not focus on searching strings-of-text but on searching entities or real-world objects.

Airbnb conducts a semantic content analysis of online hotel reviews to identify the main drivers of hotel customer satisfaction and dissatisfaction.
In the year 2020, BOSCH created new ontologies and used a SIB framework, to semantically integrate Bosch manufacturing data for the analysis of the Surface Mounting Process.

Conclusion

The migration of the current web to the semantic web is slow because of the difficulty in representing abstract concepts like natural emotions in a software-comprehensible language. Another challenge is the issue of trust and appropriateness of data and an additional factor is a pure ignorance towards it.

Although, even with the intent, how can developers determine trustworthy ontologies? The ontologies for the semantic web must be developed, managed, and endorsed by communities who maintain it over the years.

A joint effort by Google, Microsoft, Yahoo, and Yandex proposed a solution called Schema.org to promote structured data in web pages with a common vocabulary. If the ever-growing user base of schema.org is any indication, migration to the semantic web is inevitable. It is the future of the present web and surely is a realizable dream.