Micha Verhagen
SeMI Technologies
Published in
7 min readJun 14, 2021

--

A new use case showing Weaviate in action for the Royal Netherlands Standardization Institute (NEN)

Before you can offer a product or service in any country, you have to make sure that your product complies with the rules and regulations that apply in that country. These standards ensure that products comply with national and international quality, safety, and reliability standards.

Semantic search and Q&A through 34k of complex standardization documents with Weaviate. In less then 50 milliseconds

One of the main challenges is finding your way through this large and complex collection of unstructured documents. There are many standards and standardization publications. Some of these standards are voluntary guidelines providing technical specifications, others are a mandatory requirement to comply with specific laws.

Generally, each country has its own standards organization that can help you navigate these regulations. In the Netherlands, that organization is NEN, the Royal Netherlands Standardization Institute. NEN manages over 34.000 standards, which includes international, European, and Dutch standards. “NEN is the knowledge network within the Netherlands for standards development and application at both national and international level.” The reasons people contact NEN for support vary from questions on certification, to enlist in training to very detailed questions on specific topics.

Weaviate enables out-of-the-box semantic search

In this article, I will describe how Weaviate can be used to navigate this large collection of complex standardization documents. What makes Weaviate so suitable for the exploration of documents is that it uses vector indexing mechanisms to represent the data. Vector indexing places documents, or parts of documents in space, which enables semantic exploration.

Take for example the data object that describes a dough mixer:

{ “data”: “dough mixers are used separately or in a line in the food industry and shops (pastry-making, bakeries, confectionery, etc.) for manufacturing of dough by mixing flour, water, and other ingredients.” }

Storing this data object in a traditional search engine means that in order to retrieve it; you need to search for exact keywords such as: “dough mixers” or “manufacturing of dough” to find it. But what if you have vast amounts of data and you want to find this data object about dough mixers, but you’ll search for: “pasta machine”? Traditional search engines will not return any results but a vector search engine like Weaviate will. By placing data objects in the vector space we can explore the data objects that sit near ”pasta” and “dough” in the vector space. Weaviate returns results that are very close and intuitively correct.

Storing the data in Weaviate

A standardization document typically includes a scope section, which indicates for which products the standard is relevant, and a set of technical requirements that we can represent in a Weaviate schema. Below is an example of the scope and one of the requirements from the standard: NEN-EN 453 (en) Food processing machinery — Dough mixers — Safety and hygiene requirements

Scope:

This European Standard specifies safety and hygiene requirements for the design and manufacture of dough mixers with rotating bowls of capacity greater than or equal to 5 L 1) and less than or equal to 500 L.

These dough mixers are used separately or in a line in the food industry and shops (pastry-making, bakeries, confectionery, etc.) for manufacturing of dough by mixing flour, water and other ingredients. These machines can be fed by hand or mechanically.

These machines are sometimes used in other industries (e.g. pharmaceutical industry, chemical industry, printing), but hazards related to these uses are not dealt with in this standard.

This European Standard deals with all significant hazards, hazardous situations and events relevant to the transport, installation, adjustment, operation, cleaning, maintenance, dismantling, disassembling and scrapping of dough mixers, when they are used as intended and under the conditions of misuse which are reasonably foreseeable by the manufacturer (see Clause 4).

Requirement:

The distance between the frame and the outside wall of the bowl shall be at least 50 mm (b). The distance between the outside of the rim and the frame shall be at least 30 mm (a), and the height of the rim is less than or equal to 30 mm © (see Figure 3). The outside of the bowl shall be smooth.

We store the standard documents in a simple schema, where documents are stored in the “Document” class. Each document has Sections, which are used to represent the chapter structure of the document. And each Section in turn has Paragraphs, which are used to store the text in the documents. Below is a (simplified) representation of the way the documents are organized in Weaviate.

Our test data set consisted of a wide variety of standards documents, ranging from medical equipment for human body measurements to explosion prevention in explosive atmospheres to refrigerating systems and heat pumps. So quite a diverse set of standards.

Finding the right standards using Semantic Search

One of the challenges of NEN customers is to identify the standards that are relevant to the products they offer. We can solve this problem by identifying which paragraphs are semantically close to the concept we are exploring.

In the GraphQL query below we search for the concepts “machines to make pasta” in the vector space containing all standard documents and retrieve all paragraphs that are close to the position of the concept of “machines to make pasta”. How “close” a paragraph should be is determined by the “certainty” (i.e., the distance between the query vector and the result) factor in the query (set to 0.6). We ask Weaviate to return just the first result with a certainty of at least 60%.

Suppose you are a manufacturer of equipment that is used in kitchens and you want to understand which standards apply to your equipment. These are the first results returned by Weaviate of the exploration of the concept “machines to make pasta”.

Traditional search engines would have failed to identify these standards as the search terms used do not appear in the search results.

Question Answering (Q&A) on specific requirements

We can take the exploration of these standard documents one step further: suppose your question is not just to find the appropriate standard, but to also find a specific requirement within that standard. Weaviate has an optional module that can answer specific questions — assuming the answer is actually in the documents in Weaviate.

The optional question answering (Q&A) module for Weaviate uses BERT-related models for finding and extracting answers. This module can be used as a search filter in GraphQL Get{…} queries. The QnA-transformers module tries to find an answer in the data objects of the specified class — in our case, it tries to find an answer in the text property of the class Paragraph.

Assume for example that we are looking for a specific safety requirement for dough mixers; the kind of mixer you see in the (many) baking TV shows nowadays. We can first explore the landscape of all machinery that has to do with kitchen equipment. In the search below we continuously explored the concept of “dinner” — not a word that would appear in the standards, but of course, does have a semantic relation to the content of the documents:

  • NEN-EN 15774 Food processing machinery — Machines for processing fresh and filled pasta (tagliatelle, cannelloni, ravioli, tortellini, orecchiette and gnocchi) — Safety and hygiene requirements
  • NEN-EN 453 Food processing machinery — Dough mixers — Safety and hygiene requirements
  • NEN-EN 13621 Food processing machinery — Salad dryers — Safety and hygiene requirements

The specific safety requirement that we are looking for is the minimum distance between the wall of the bowl and the frame of the mixer. A GraphQL query to find the distance would like like this:

The abbreviated answer by Weaviate is given below. Note that Weaviate returns more information on the specific result than what is shown below, but for clarity, we just show a snippet. In the response by Weaviate, you can see that it is 76% certain that it found the correct answer: 50 mm. The answer is found in paragraph 5.2.3 of the standard document NEN-EN 453:2014 en, with the title: “Food processing machinery — Dough mixers — Safety and hygiene requirements”.

Conclusion

Semantic search makes exploring complex unstructured documents a lot easier because it does not rely on the exact matching of terms. You can discover documents, or parts or documents, that are literally and figuratively in the neighborhood of the concept that you are exploring.

So if you are looking for a way to offer your customers or your employees a better way to navigate large collections of unstructured documents, contact SeMI for more information. Weaviate is Open Source, so if you want to discover the power of semantic search firsthand, the documentation is a great place to start.

The Weaviate setup

For this setup, we have used:

  • Weaviate 1.4.0
  • Weaviate vectorizer module with Sentence Embedding Model for MS MARCO
  • Weaviate Q&A module with Distilbert enabled.
  • Google Cloud n1-standard-4 (4 vCPU, 15 GB memory) and NVIDIA Tesla T4 GPU (estimated costs are $286.85 per month)

Create your own Weaviate setup here.

Disclaimer

Because the standards contain proprietary data, we can’t open the source of the dataset. But there are similar datasets available through our documentation.

--

--