Access the unstructured

Published in

qaecy

7 min readNov 9, 2021

In the last article, we looked into the general setup and how a data-driven approach could work. But maybe you are (like almost everyone) primarily working with documents. I have heard of organisations that do everything in SAP, but I have never met anyone. So if you are one of those, I guess you can stop reading. But, if you're not, just think a moment of that archive your organisation for sure has and how to access it….

Paper is ok, but think of **12.053 filetypes.** Photo by C M on Unsplash

Documents are great because we run Software to do stuff and get a result that we can put somewhere. This way, we store the efforts of our work in Documents as data but form the viewpoint of a machine with little information.

The problem

Ok, it is not only one; there are many. If it is Word, Excel or some fancy BIM Software, we have data in documents but don't own the data structure behind it. Just as a comparison: If you take an open standard like .ifc, you get both the data model and the data itself.

And speaking of documents, there are a lot of them. There are websites just filled with lists what kind of documents do exist. To be fair, some are accessible and well structured, but one website counts as little as 12.053 filetypes.

And one more thing: it is tough to access content within a document. To indicate what is in a file, we often use naming conventions like the one below that display what is hiding in such a thing.

PR1-AMG-V2–01-M3-C-15:05:15–0001-SP1-P02

"This one says it's project one, provided by Amberg group working on volume, as a 3d model, discipline is civil engineer, it is suitable for coordination, and it is the second preliminary revisions… and guess what: that is individual in every project, right?"

Naming conventions can be complicated (gif by https://giphy.com/)

This way, a common CDE organises the documents as files. Whoever worked this way knows how crazy it is to upload such a thing and how much you need to take care of that naming. Years ago, as a test I have uploaded empty files with the right name… no one noticed! And even doing so, we just label the box; we don't really look into it.

Entity extraction

In a straightforward scenario, let’s suppose that we have thousands of PDF files and want to look up a specific word, like “Philipp”. That is a straightforward task, as we can simply use a find method that iteratively processes all the PDF files and detects the requested word instances, together with their frequency of occurrence.

But what if we want to look up all the people mentioned in the same files? This would require a certain level of machine intelligence, that is able to understand that Philipp, John, Christina, Nan etc., are all people. The same goes for other entity types, like organisations, dates, places, or any categorisation one can come up with. The technology that tries to deal with these problems in the domain of Natural Language Processing (NLP) is called Named Entity Recognition or Entity Extraction.

Very simply, a certain learning algorithm is “trained” to identify the entity type of a word instance by encoding its context in the sentences that it has appeared. Then, when the instance appears in a completely new sentence, the algorithm can determine the probability that this word (e.g. Philipp) is of this type (e.g. Person).

Thinking that such algorithms require enormous amounts of data can be demotivating. But that’s where Language Models (like BERT or GPT-x) come into play. The latter are huge pertained neural networks, which are open source and freely available to download (…or at least most of them). One can fine-tune a Language Model towards a specific downstream task (like Entity Extraction, Question-Answering etc.), by using quite smaller amounts of data, which are much more manageable to access and process.

And the nice thing is that it doesn’t stop here. You can extract relations between entities as well! For example, in the sentence: “Philipp is working at Amberg Group”, it is not only possible to identify that Philipp is a Person and Amberg Group is an Organisation, but also that (Philipp: Person)-[WORKS_FOR]->(Amberg Group: Organisation). If this structure reminds you of a triple, it’s because it is one ;-) remember the first article? By parsing multiple PDFs, we can extract thousands of those triples, which can be composed into a single Knowledge Graph that we can use to access information in a structured way. We could also go a step further by using Entity Linking to connect our extracted entities to external web-sourced information!

What does this BERT thing do? (gif by https://giphy.com/)

Now imagine you would have an archive with PDFs that include relevant, location-dependent information that you want to use in a tender to calculate a new project at a given location! You can dig into thousands of documents, open folder after folder, try to understand if that document is related to your current project, read each file and try to extract what interests you… Yeah, well maybe there is a better way. Just four small steps:

Step 1: Docs often come in different qualities. Think of regular PDFs or scanned paper or these corrupt PDFs. Humans can easily read all three kinds, but easy machine access is only with the first ones. Luckily, there are things out there like OCR to make a scanned or a corrupt PDF exported as a picture accessible. Additionally, there are open source algorithms that help figure out the different layout areas (title, table, picture etc) in the PDF file.

Step 2: Now, let’s try to use entity extraction on this one. For soil mechanics, we are looking for a certain quantity, let’s say X MN/m2. Computers don’t know what that means, but we can teach them to identify it by providing certain examples to fine-tune an Entity Extraction algorithm. We can also look for people or addresses and store it all in our database. Then having defined a certain schema upfront, we can map each extracted entity to its schema equivalent and gradually build our graph. (How to build a simple schema, I think we will do an article itself. Just as a reminder, there are existing schemas and ontologies out there like schema.org on a general level or brickschema.org for buildings and HVAC. But you can also edit or build an individual one using editors like protege or gra.fo).

Step 3: We take a second source to add information we didn’t have before. The same way we use Google Maps to look up an address, we can use it to give us longitude and latitude for that address. Considering such a source, the Google Maps API gets back coordinations that we again store to our database. Performing reverse geocoding of coordinates to municipalities, we can connect to cities, cantons or countries!

Step 4: Now, we can display each value taken from thousands of files on a map, maybe put a fancy frontend, so people just need to roll over a map and get all information that comes to this position taken from our archives. We can include it on our intranet page or BI if we want to. One simple window will give you access to valuable information taken from thousands of documents. 1+1=Many.

This is just one example of one specific use case. But, what if we could access multiple sources of information, extract what is valuable and tailor a schema that represents expressively how things function in our organisation?

We could create a meaningful enterprise-wide Knowledge Graph. We could also expose parts of it to our collaborators or clients using Linked Data principles or simply GraphQL. We can gradually decouple our data from applications. We can make our data smarter by giving it context (1+1=Many). This approach makes it possible to run queries across different databases without transforming, translating, or interfacing our data. This way, our data becomes more meaningful, manageable and insightful across our application landscape. By linking to others, we can even query others data and add value to our own.

Remember the problem described above? You get data but not the data model? Think of it on a company level. Now you can create your company data model AND have all the data… this is one small step to separate data from applications to become independent and future-proof!

We hope to get you some interesting insights with this article. Questions are welcome; we will continue…

Access the unstructured

The problem

PR1-AMG-V2–01-M3-C-15:05:15–0001-SP1-P02

Entity extraction

Written by philipp dohmen