Locally hosted LLMs for entity and relationship extraction

Comparing the quality of responses generated with self-hosted Mistral-7b with OpenAI’s ChatGPT

Published in

Kineviz

7 min readFeb 6, 2024

A huge advantage of large language models (LLMs) is that we can efficiently extract entities and relationships from unstructured data (such as the text in pdfs). Once that’s done, we can visualize and work with the resulting live knowledge map directly in Kineviz SightXR.

OpenAI works well for this, but it’s not a good solution when you need the input data to remain private or confidential.

As an alternative, I looked into performance of locally-hosted, open source solutions as compared to OpenAI. After trying various LLMs, I find that LLM Mistral:Instruct 7b can be made to work quite well.

The source data for comparison was the recent article, The Cases Against Trump: A Guide — The Atlantic.

It turns out that compared to OpenAI GPT 4 and 3.5, LLM Mistral:Instruct 7b does a good job at extracting the main relationships between the people, organizations, locations and events mentioned in the article about Trump’s legal troubles, and at preserving the main points of the article. This is all the more impressive because we are using a 4-bit quantized version of the model running on a 32 GB RAM RTX3060 GPU with 8 GB VRAM (GPU RAM).

Smaller chunks of the source text are necessary to ensure quality results, so it does take longer to generate the response and a higher number of calls to the LLM. This might be an artifact of quantization. But because the calls are free, as long as you are not time-constrained, this shouldn’t be a problem.

Process for testing a locally-hosted LLM

Define ground truth

The first thing was to read the source article and select the main points that should be preserved in order to understand the article, and which POLE (Person, Organization, Location, Event) entities should be found in each of those main points. I identified 8 important points that a complete knowledge map of the test article should contain.
The best performing LLM will be the one that matches most of this ‘ground truth’.

2. Prepare the prompt and examples

I included some of the techniques that can be used to improve performance in LLMs. I provide the LLM with 10 examples of what the output must look like (“few-shot” prompting). I also ask that the LLM add an “explanation” field. This can act as a chain-of-thought style of prompt that makes the LLM justify its answer. Finally, I provide definitions for the entities.

Few-shot examples with chain-of-thought style “explanation” field

I also tried two different prompts:

Prompt 1. I simply ask the LLM to provide the entities and the relationships.

Prompt 2. I ask that the LLM first consider the most salient points of the article, and taking those points into account, only then extract the entities and relationships. This worked better both for GPTs and Mistral.

3. Choose the LLMS in the running
The models compared in this first exploration are:
OpenAI GPT 4
OpenAI GTP 3.5 turbo 16k
Mistral:Instruct 7b quantized

A note on infrastructure

Ollama is an easy way to run open source LLMs that are GPU-bound on consumer-grade hardware. By default, Ollama uses 4-bit quantization and where possible layers are off-loaded onto the GPU for faster processing.

Lite-LLM makes it possible to call all LLMs using the OpenAI API format, so your code doesn’t have to change depending on which LLM provider you are using.

Results

Extracting structured information from unstructured text in order to generate a live knowledge map is quite workable with a locally hosted LLM. The resulting knowledge map visually highlights key points, entities, and their relationships, and as a visual guide, it helps in exploring complex connections in the information.

In comparing the performance of an open source model like Mistral:Instruct vis a vis GPT 4 and 3.5, an important consideration is the context window, which refers to the maximum amount of text the model can consider at any one time when generating a response. This includes both the prompt provided by the user and the model’s generated text. The models I worked with have the following context window sizes:

GPT 4 8,192 tokens [1]
GTP 3.5 turbo 16k 16,385 tokens [2]
Mistral:Instruct 7b quantized 8192 tokens [3]

Using the tiktoken library and Mistral:Instruct tokenizer hosted on HuggingFace, we can calculate how many tokens it will take to extract the entities and relationships from our full prompt + examples + the text. Both tokenizers count just over 4k tokens. This means that even for GPT 4 and Mistral:Instruct, feeding in the full text of my test article to be analyzed all at once is well within the context window. The LLM uses the first half of the 8k tokens for input, and has the remaining half to generate the response.

However, in practice I find that Mistral:Instruct fails to follow the formatting instructions when working with the full text, and only returns a properly formatted response when I separate the source text into multiple “chunks”. Following the format is important as it lets us properly parse the response for further downstream applications. I generated the best Mistral responses with 3 or 4 chunks. Again, this might be an artifact of the quantization.

Also, the results are a tad more “noisy” than those generated by GPT 4 and 3.5. For example, sometimes Mistral:Instruct returns the same relationship more than once in a slightly different formulation. This will result in a more visually cluttered knowledge map.

Testing the two different prompts on the LLMs showed that in general, the second prompt performed better. In particular, Prompt 2 performed better in making sure the main points of the article were included in the knowledge map.

With Prompt 1, which simply asks for entities of the type people, organizations, location and events and the relationships between them, GPT 4 and 3.5 manage to identify almost all the points of the article and generate entities for them, but do miss out on one and two points respectively. Again, Mistral:Instruct only manages to generate useful results when we chunk our text into smaller pieces for piece-meal consideration of the article. Splitting the text into four chunks generates good coverage but also a higher number of relationships (23 relationships compared to the 13 returned by GPT 4) including many that refer to the same incident. So while the content is preserved, we do end up with a more visually cluttered knowledge map. Including a second step where we prune relationships which are very similar semantically could be a solution for the more ‘noisy’ results returned by Mistral:Instruct.

With Prompt 2, where I ask the LLM to first identify the most important point of the articles, and only then find the entities and their relationships, the LLMs perform better than with Prompt 1.

All three LLMs return very complete coverage (as their results set includes all salient points of the article) but Mistral:Instruct 4 chunks returns 20 relationships while GPT 4 returns 13 and 3.5 11.

The tables below show in detail the results for the two different prompts, models, and chunking strategies.

Prompt 1 Results

Prompt 2 Results

Conclusion

There are many reasons to choose an open source LLM instead of relying on the OpenAI API. The most pressing is the ability to self-host and maintain data privacy, but cost considerations also come into play. Generally, the trade-offs identified in tests of a locally hosted LLM seem acceptable for users working with confidential data.

Evaluating LLM output and performance is an active area of research. This first attempt to quantitatively measure the difference in performance between GPT 4, 3.5 and Mistral:Instruct, one of the open-source alternatives, indicates that further investigation and benchmarking is definitely worthwhile.