The power of knowledge graph

philipp dohmen
qaecy
Published in
12 min readSep 21, 2021

The L-CDE project Part 1.

OK, we have to start with some basics, and I promised to keep the technical nerd stuff as little as possible, but there are some things that we need to talk about to firm up a common understanding. This article is about a fundamental one: knowledge graph. In summary: we believe that the future for cooperation in the construction industry is more based on data and not so much on files. If we need to see things on a data level and work together, we need to connect data. The gap we want to fill is connectivity. Still, in our approach, we enable the handling and viewing of federated models. We take care of data ownership and intellectual property. If nothing else, we strictly rely on opensource and dramatically reduce costs.

Oh, and it is nothing we made up or is just academic research; we use real industry standards (like RDF or LPG) and battle-proven technology (like graph database).

A big sorry in advance to all the experts for losing preciseness sometimes for more general talking. But we want common people to be interested too, right? There are a lot of interesting discussions on the definitions of Semantic Web, Linked Open Data and Knowledge Graphs, but we ’ll keep them out of the scope of the current text.

Daily usage

Even though it affects our lives daily, most people have no idea what a knowledge graph is. When you ask what Google does, most people will say it will give you a nice list of websites (next to some advertisement)… but this was long ago. So if I google “carbon dioxide”, for example, what do I get back?

Google and you make use of knowledge graph.

Even I put in the English word, and Google answers me in german because they know where I live and what language I speak. But, I will also get an entire box of information that it is a climate gas, what chemical formula or mol mass it has, that there is other gas like oxygen. I will get the information that other people looking into this were also searching for things like “climate change”… How is this possible?

What you are looking at is a knowledge graph, the backbone of today’s state-of-the-art information systems. All the context offered by searching for “carbon dioxide” may be provided from sources like DBpedia or WIKIdata. (This is a simplification, of course, Google has billions of nodes and can calculate probabilities and importance to connect data from multiple sources, but let's stay with that simple cristal clear data model for the moment) And this context is improving search results over question answering and recommender systems up to explainable AI systems …. and now stop a second and think of getting this into the project you are working on!

DIKW

Ok, how could we apply this to our work? We need to step back and do the theory before we go into details. So, a short breakout. You all have heard of the DIKW (Data, Information, Knowledge, Wisdom) pyramid. It`s been used in so many forms; no one can keep track. Even here in medium are some excellent explanations. And of course, everybody knows that value goes along with the meaning. And maybe you will agree that we have a lot of data and some information, but we lack knowledge and wisdom in our industry?

The good old DIKW

But what is knowledge? And how do we represent it? Traditionally, it is explained as a justified subset of all true beliefs, and we just keep it this way. And representation? Of course, language can be a way to represent knowledge, but as we talk digitalisation here, we will focus on formal knowledge representation, so-called schemas or ontologies.

Formal Knowledge Representation is part of AI (artificial intelligence),
which unambiguously gets the meaning of concepts, properties, relationships, and entities of specific knowledge domains (like constructions), as structured data. This way, Computers can understand formal knowledge representation and interpret it correctly. (If you want a peek at Web of Data and how it is linked, have a look here: https://lod-cloud.net/.) And there are uniform metadata schemas for buildings too. Schemas, like Brickschema are small ontologies, but we leave that for now and will come back to it in a separate episode.

As an extension of the current Web, the meaning of information (Semantics), as it is formulated for a specific domain, is made explicit by these Ontologies. This way, it is possible to automatically process the meaning of information, relate and integrate heterogeneous data, and deduce implicit information from existing information in an automated way. This idea is now out there for twenty years! And it has gradually evolved into more mature approaches through its confluence with recent advancements in other fields, such as Machine Learning on Graphs.

„The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

Tim Berners-Lee, James Hendler, Ora Lassila: The Semantic Web, Scientific American, 284(5), pp. 34–43(2001)

If you still think the internet is a URL address with some HTTP to find it and some HTML so you can access it by a browser… that's 30 years ago. Ok, now for the techy part, but we will focus only on one aspect this time.

http://bnode.org/blog/2009/07/08/the-semantic-web-not-a-piece-of-cake
Stuff being used for the semantic web described by Benjamin Nowack

Of course, there are techy things we could talk about, like the main difference between URI and URL or what formats are out there, but there are tons of websites that explain better than we do. But there is one thing we need to talk about because we will use the concept many times: How do we store information? In a graph database that keeps the entities AND relationships between them!

Triples and Graph

To make use of the semantic Web, we need to store stuff. Storing data most of the time either takes place in a hierarchy (for example, XML) or in a relational database (for example, MySQL, MS SQL), but there is another way too: graph databases. These things have some advantages we love. And we strongly believe that we can benefit from these advantages (see below) simply because planning and construction are too complex to be filled in columns and rows.

But first, let's talk about triples. I remember Mads Rasmussen speaking of triples back at the bSI Summit in London 2017. As a recap, he explained how to use triples for the need of an engineer, how to store and how to work with it as a kind of a common data concept. The next presenter showed again excel and talked about how complicated it was to get a consisting data set…. Noone started screaming: “Why the f``` didn’t you listen to what the boy just explained!!!”

Triples have nodes and edges like one thing that's connected to another thing. Ready for our first Triple? Coming back to our example of carbon dioxide, how could we represent simple facts? Let’s say that: carbon dioxide is related to climate change.

subject: carbon dioxide — predicate: is subject — object: climate change

A triple: primitive mini sentence computers and people can read

With these mini sentences, a data graph is defined, and we can describe all kinds of things. Subject, predicate and object in terms of basic data graphs is the shift from storing data in relational or hierarchical models to storing in graph models. Yeah, nice, but what to do with it? Or why should this be of benefit? Well, we focus here on two things: Connectivity and Speed.

Connectivity

There is a special way of triples, the RDF (The Resource Description Framework). RDF is an acronym that forms one of the basic building blocks for the Semantic Web stack, and it is directly related to our so-called graph database. What makes RDF triples special is that EVERY PART of the triple has a URI associated with it. This way, our sentence: “carbon dioxide is related to climate change” is connected to all other subjects of climate change.

URL URI … mhhh? don't mind

“ Just keep in mind a URL identifies what exists on the web, while a URI identifies on the web what exists”

Prof.Dr. Harald Sack, FIZ Karlsruhe

When we use common property and store that sentence using existing terms, we could explain the very same as above in RDF like this:

<http://dbpedia.org/resource/carbon_dioxide> <http://purl.org/dc/terms/subject> <http://dbpedia.org/category/Climate_change>

No, don`t worry, you don`t need to learn how to write like this. It can be done in the backend that you will never see. BUT doing so, we connect our small sentence to a giant; the semantic web! This way, we get access to all that relations and metadata, just like Google does. So, for example, everything else that is also related to Climate_change can now be found too. Linking into more domain-specific data (like cement), we can inherit everything related to cement and carbon dioxide. Connecting into these sources, we can get, for example, the Greenhouse gas potential of cement, which is around 587kg CO2- equivalents per ton… remember the justified subset of all true beliefs?… wouldn't it be great to work together on a set of the very same justified, true beliefs? In planning, we tend to invent the wheel, again and again. How about linking into existing and make use of what has been discovered before?

Using triples, we can tell that there is something called an area, and it has 30m2.

Ok, now we have things described this way, we can do some astonishing things. In principle, we can create new BIM elements WITHOUT modelling. We can deduce new information from existing information. We can calculate and track all attributes or properties down to their source to identify outdated calculations and determine who is affected if things change. Oh yes, and we can split between public and private to protect intellectual property. But one after the other, let's start with creating things.

Benefit: Creation

If we have one thing we can create others from it

If we have one of these nodes, we can create new things by rules. For example, each area named “office” bigger than “30m2" will need three chairs, a desk and a light. Changing the size or usage will affect all the objects, of course. One of the problems we always run into is that BIM is mainly about geometry and modelling, and many things we put as graphics in BIM don`t necessarily make sense. But still, it would be nice to keep track or count them, right? By rules and data organised this way, we can create any object without modelling.

Benefit: Inference

Next to new elements, we have other things that do depend on each other. For example, moving a wall for some meter doesn’t affect anything, right…? Well, one room gets smaller and the other bigger, so how about things like heat demand?

“Who the hell did change the size of that room from 30m2 to 40m2 without telling anybody?” It doesn`t clash, so probably it will not be noticed immediately. Did you ever try to backlog a project when new models or new schedule timetables are uploaded every week? Were you able to figure out: “This attribute was changed, by that party, on that day, because of this meeting?“ Lucky you, most of the projects can`t do this.

Changing CurrentState to a new value without losing the outdated information

What if we could have a history on each of the nodes (say, each attribute)? What if we would just change the status and keep all the old entries and enhance transparency this way. (Oh, and don't you worry about too much data, nowadays we speak of trillions of triples) Maybe we could also bring back a little bit of trust, and if we go crazy (do I really have to say it loud?), we could easily blockchain this? Fun fact: it is straightforward to keep track if you handle data this way.

Benefit: Intelectual property

This is one of the things that always pop up, and how about bringing back trust and cooperation…? ”Ah no, we are good, we do claim management, thx a lot”. We need to share in order to work together, but of course, we don't need to share EVERYTHING. Size, Colour, Position: okay, but Price? No way! Now you have your lovely BIM Model with all the intelligence put to it, how do you partly share it so things can be coordinated? Do you do one and delete stuff before sending, or do you handle two, one for internal one for coordination? Or do you do the fancy work outside in some other software?

By putting “private”, we can filter what goes out and what stays within our firewalls.

What if we could have, just like the status, a tag to each node we don't want to share and make it private this way? What if we had ONE data model to work on, which you can completely control and still work together with other parties?

Benefit: Speed

First of all, if you use Graph, it performs very, very, very well even on large datasets! So we could just stop here, and there is a lot of information out there you can dig in. But for all that want to know, first, we need to explain the difference, ok?

The different concepts of the database (gif by giphy.com)

In a traditional database, each entry is composed of a row in a table. If you need to make two or three SQL Joins (table connections), it’s still working, but the more you add, the more it will slow down. A graph database is much faster than a relational database in graph traversal because the structure is entirely different by how relationships between entities are stored. In a graph database, relationships are stored at the individual record level and doesn't need a predefined structure, a.k.a. table definitions. Relational databases are fast when handling huge numbers of records, and the structure is known ahead of time…. Now think of our construction industry, where every day something new pops up we didn`t think of right from the beginning and we want to add later…. It sounds familiar, right?

Graph databases don’t have a predefined structure for the data, so each record can be examined individually during a query to determine the structure of the data. This way, it is easy to put additional nodes or links to other data sources. OK, enough tech-talk, so “How fast is fast?”

Well, in informatics, they use a lot a simple friends of friends queries. With that, you try to find someone through the relationship he has with someone else. For computers, it's kind of annoying cause they need to check every relation, and you know how numbers go up if you put one rise grain to the first field of the chessboard, then two to the second and so on…. And here, the wheat separates from the chaff.

Five relations and things got weird.

If it is just the friend of a friend query, it's a split of a second, so who cares. For friends of friends of friends, you will be around 180 times faster. And for the depth four queries, you are more than 1000x times faster… To be fair, this test is made for a graph database, but it exactly fits our needs in our industry because we face the very same problems. We have thousands of relations and dependencies in different sources, and that is why we believe that the world is too complex to put in rows and columns but beautifully can be described in graphs.

What does it mean in real life? If you ever did a modelcheck and had some things to look at in a big project, you can not only grab a coffee but order a pizza too until the checking is done, right? So we did some stress-test and had a million nodes to compute (equivalent to 500MB IFC file), and the response time (yes, for the nerds, it was a warm start) was 0.1sec.!!!!!

Wait to see the powers of graph jack jack (gif by giphy.com)

With this power, you can not only query a project; you can query a portfolio of all assets at once! And the best thing is if we start to describe our projects as graphs, we actually can query a portfolio because we use the very same logic behind it. What that query language is we will look into in one of the following articles.

So what do you think of a data-driven way to cooperate? A way that's based on opensource and brings the power of graph to the AEC industry to make construction industry open and connected?

We hope to get you some interesting insights with this article. Questions are welcome; there is so much more we want to talk about…

The Amberg digital team

--

--

philipp dohmen
qaecy
Editor for

Architect and strategist for information technology in constructions. I love spreading ideas and innovations for a data-driven AEC industry.