Creating Accurate AI: Coreference Resolution with FastCoref

11 min readOct 15, 2023

Introduction

Have you noticed that Large Language Models (LLMs) can output wrong information — even when you provide the written context? When I initially began creating chatbots, I noticed that the chatbots would sometimes say the opposite of the provided context. I also noticed that generative AI would sometimes wrongly conflate facts from two different contexts when creating news articles and blog posts — resulting in seemingly made up information. Thus, I began a quest with the lofty goal of making AI so accurate that it finally exceeds what humans are capable of — taking a giant step forward toward Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI).

One essential component of building accurate chatbots is using a Natural Language Processing (NLP) technique called Coreference Resolution. In this article, you will learn:

What is coreference resolution?
What are the limitations common to all coreference resolution models?
What is FastCoref?
How to optimize FastCoref — overcoming the standard limitations of coreference resolution models.
How to deploy FastCoref as a REST API for use in creating chatbots and generative AI projects in any language.
A discussion on state-of-the-art coreference resolution.

Using coreference resolution is a quick way to dramatically improve the accuracy of your chatbots and other generative AI projects. Even if you are a non-Python programmer, you will learn to create a REST API to incorporate state-of-the-art coreference resolution into your AI projects using any programming language of your choice. So let’s get started!

AI Hallucinations Due to Losing Track of Context

One large problem with AI language models is context. While the models are trying to predict the next appropriate word, it’s easy for them to lose track of the current context. And once the model loses track of context, it starts to spit out what appears to be “hallucinations”—utterly false statements written as if they are true.

One major reason for the model losing context is due to the nature of language itself. Consider pronouns as one example. A text may mention a person such as George Washington. Afterwards, the text can continue referring to Mr. Washington as ‘he’ or ‘him.’ If these pronouns are used in a very lengthy text, Large Language Models (LLMs) such as ChatGPT may eventually lose track of who ‘he’ and ‘him’ are referring to. When this happens, the model will create a ‘he’ or ‘him’ to generate intelligent sounding output — but it won’t be the correct ‘he’ or ‘him.’

Pronouns aren’t the only challenge. Consider another text regarding the movie Jaws. This text may repeatedly refer to Jaws via the phrase ‘the movie.’ Once again, in a lengthy text, the model may lose track of what ‘the movie’ is referring to.

Synonyms present yet another complication. Consider a text that begins: “Yesterday, I saw the movie Jaws.” However, the text might later refer to Jaws as a ‘flick’ — a synonym for movie. When the text uses the term ‘the flick’ later in the document, the model may lose track that this term is a reference to the ‘the movie Jaws.’

Now for the final issue that’s paramount for achieving AI results beyond human capability. LLMs can only process a certain number of tokens per exchange. In other words, the combined question and answer cannot exceed a given length. To accommodate this limitation, large texts are often split up into chunks. LLMs often only have access to part of the document when trying to create an accurate answer. (I.e. they often only have some of the chunks of the document—not the entire text—when generating their response.)

Now consider the following all-too-common scenario:

A large document that must be split into five chunks: Chunk A, Chunk B, Chunk C, Chunk D, and Chunk E.
The title of the movie Jaws is only contained in Chunk A.
Chunks B through Chunks E reference Jaws as ‘the movie,” “the flick,” and so on.
Chunk E discusses Roger Ebert’s review of ‘the flick.’
A chatbot must answer: “What did Roger Ebert think of the movie Jaws?”

In such a case, Chunk E might not even be discovered as having information since the word ‘Jaws’ is not contained within it. That’s one problem. However, even if Chunk E is chosen to be sent to the LLM because of the phrase ‘Roger Ebert thinks,’ there’s still the problem that the word ‘Jaws’ is not contained in Chunk E. Therefore, the LLM might say something like “the provided context does not say what Roger Ebert thinks of Jaws.”

Even worse, consider the existence of other chunked documents that contain ambiguous references to other movies. Some of these chunks also contain the phrase ‘Roger Ebert thinks.” But these chunks refer to different movies altogether. For example, maybe Roger Ebert loved the movie Jaws, but abhorred the movie We Bought a Zoo (sorry Matt Damon). If the ambiguous chunk regarding We Bought a Zoo is sent to the LLM then the LLM might infer from the question that the provided context is regarding Jaws. Therefore, the LLM will incorrectly write that Roger Ebert detested Jaws. Moreover, it could write that Ebert detested the scene in Jaw’s where the family tried to renovate a country-side zoo.

Of course that’s not the plot of Jaws. But I’ve personally seen ChatGPT make this exact type of error. ChatGPT makes assumptions regarding the topic based on the words in the question itself. Any ambiguity in the provided context can wrongly be conflated with the topic of the prompt itself.

One very fast way to reduce the occurrence of this error in your AI projects is to use a Natural Language Processing (NLP) technique called coreference resolution.

What is Coreference Resolution?

Coreference resolution is the task of finding all linguistic expressions (called mentions) in a given text that refer to the same entity. In practical terms, it refers to replacing the ambiguous references with the identity of the entity itself. For example:

Before: Review by Michael Wood. Yesterday I saw the movie Jaws. It was incredible. The movie left a lasting impression on me.
After: Review by Michael Wood. Yesterday Michael Wood saw the movie Jaws. The movie Jaws was incredible. The movie Jaws left a lasting impression on Michael Wood.

The text in italic-bold was added by the coreference resolution process. Notice how some ambiguous sentences are even converted into self-standing facts. For example:

Ambiguous: The movie left an impression on me.
Self-standing fact: The movie Jaws left a lasting impression on Michael Wood.

Coreference resolution is one technique for carrying content forward throughout the document. This can even result in carrying content into the chunks that are created when the document is split apart due to token limitations.

Common Coreference Resolution Limitations

As stated above, coreference resolution involves carrying the context forward. But what happens when the pronoun is used first. In other words, what happens when the pronoun is used before the entity itself is described? This is known as the cataphora problem.

Many coreference resolution models cannot detect when this occurs. Other coreference resolution models, such as Allen NLP, recognize the occurrence, but often handle these occurrences wrongly—assigning the wrong entity, making the problem even worse. After all, the only thing worse than an ambiguous reference is use the wrong explicit reference. And that’s what Allen NLP too often does.

There are other common issues as well. Fortunately, Neurosys wrote elegant Python code for resolving the common issues. Their technique is post-processing algorithm that corrects the Allen NLP output.

However, Allen NLP’s coreference resolution is extremely slow. Too slow to be used in many production projects. Nevertheless, if you are using Allen NLP, you can add the aforementioned Neurosys code to greatly improve it.

The Neurosys post-processing fix is as follows (don’t worry if you don’t understand the code yet):

def get_span_noun_indices(doc: Doc, cluster: List[List[int]]) -> List[int]:
    spans = [doc[span[0]:span[1]+1] for span in cluster]
    spans_pos = [[token.pos_ for token in span] for span in spans]
    span_noun_indices = [i for i, span_pos in enumerate(spans_pos)
        if any(pos in span_pos for pos in ['NOUN', 'PROPN'])]
    return span_noun_indices

def get_cluster_head(doc: Doc, cluster: List[List[int]], noun_indices: List[int]):
    head_idx = noun_indices[0]
    head_start, head_end = cluster[head_idx]
    head_span = doc[head_start:head_end+1]
    return head_span, [head_start, head_end]

def is_containing_other_spans(span: List[int], all_spans: List[List[int]]):
    return any([s[0] >= span[0] and s[1] <= span[1] and s != span for s in all_spans])

def improved_replace_corefs(document, clusters):
    resolved = list(tok.text_with_ws for tok in document)
    all_spans = [span for cluster in clusters for span in cluster]  # flattened list of all spans

    for cluster in clusters:
        noun_indices = get_span_noun_indices(document, cluster)

        if noun_indices:
            mention_span, mention = get_cluster_head(document, cluster, noun_indices)

            for coref in cluster:
                if coref != mention and not is_containing_other_spans(coref, all_spans):
                    core_logic_part(document, coref, resolved, mention_span)

    return "".join(resolved)

Note: For non-Python programmers, below is link to an install script that automatically creates a REST API endpoint. No python skill needed.

In short, the bottom function makes use of the top three functions. You call the bottom function with the Spacy processed “document” as well the coreference resolution “clusters” output from Allen NLP. The bottom function then rewrites the document, replacing the ambigious references with the identity of the entity itself. (See above for example before and after.) The modified text is returned from the bottom function.

If you are a Python programmer already using Allen NLP in your project then the above code should prove very helpful. However, if you are using FastCoref then the above code needs to be wrapped because FastCoref uses a different annotation for its coreference resolution clusters. The full code and full explanation (for non-Python programmers) is provided below.

Introducing F-Coref (aka FastCoref)

FastCoref lives up to its name. It’s an extremely fast coreference resolution library. In fact, FastCoref is 29 times faster than Allen NLP. Yet, despite its speed, its also almost as accurate as Allen NLP.

FastCoref achieves accurate coreference resolution at a speed often needed in customer-facing, production applications. However, FastCoref has the same limitations as Allan NLP.

FastCoref Optimization

In my pursuit of accurate AI, I’ve been working on a project called NLPKit. I chose FastCoref as the initial coreference resolution library. Then I ported the Neurosys Allen NLP solutions to FastCoref.

The code (without the functions) is as follows:

# import packages
import spacy
from spacy.tokens import Doc, Span
from fastcoref import FCoref

# instantiate nlp and model objects
nlp = spacy.load('en_core_web_sm')
model = FCoref()

# sample text to transform
text = "We want to take our code and create a game. Let's remind ourselves how to do that."

# rewrite the text with optimized FastCoref in three simple steps
doc = nlp(text) # Step 1: Apply Spacy NLP model to create the doc
clusters = get_fastcoref_clusters(doc, text) # Step 2: pass the Spacy doc and the text itself to get the FastCoref clusters AND convert them to the same annotation as AllenNLP
coref_text = improved_replace_corefs(doc, clusters) # Step 3: pass the doc and the converted clusters to the NeuroSys function (provided above)

It’s just that easy to rewrite text with optimized FastCoref coreference resolution.

Here are the remaining functions needed:

def get_fast_cluster_spans(doc, clusters):
    fast_clusters = []
    for cluster in clusters:
        new_group = []
        for tuple in cluster:
            print(type(tuple), tuple)
            (start, end) = tuple
            print("start, end", start, end)
            span = doc.char_span(start, end)
            print('span', span.start, span.end)
            new_group.append([span.start, span.end-1])
        fast_clusters.append(new_group)
    return fast_clusters

def get_fastcoref_clusters(doc, text):
    preds = model.predict(texts=[text])
    fast_clusters = preds[0].get_clusters(as_strings=False)
    fast_cluster_spans = get_fast_cluster_spans(doc, fast_clusters)
    return fast_cluster_spans

Deploying Optimized FastCoref as a REST Endpoint

Python’s flask library makes it easy to create a REST endpoint. The code is as follows:

# import packages
from flask import Flask, render_template, request, redirect, session, jsonify
 
# instantiate app object
app = Flask(__name__)

# define the route
@app.route('/coreference', methods=['POST'])
def coreference():
    content = request.get_json() # get the post body as 'content'
    text = content["text"] # get the text (the value of the posted 'text' key)
    doc = nlp(text) # step one (see above)
    clusters = get_fastcoref_clusters(doc, text) # step two (see above)
    coref_text = improved_replace_corefs(doc, clusters) # step three (see above)
    return jsonify(coref_text) # send the rewritten text back

# launch the endpoint on port 5005 (or any other port of your choosing)
if __name__ == '__main__':
    print('running the app')
    app.run(host='0.0.0.0', port=5005)

Naturally there needs to be error checking in the REST endpoint to make it robust for production. The accompanying GitHub repo for this article contains a more robust implementation of the REST API along with step-by-step instructions on how to implement it on an Ubuntu 22.04 server.

Step-by-Step Installation Instructions for Non-Python Programmers

The following installation instructions don’t require any Python knowledge. The Readme file also contains a link to a video showing how to setup and use the REST API.

State-of-the-Art Coreference Resolution

Coreference resolution accuracy is often measured against OntoNotes corpus. F1 scores for state-of-the-art coreference resolution packages include:

Allen NLP: 79.6%
F-Coref: 78.5% (29 times faster than Allen NLP)
LingMess: 81.4% (twice as fast as Allen NLP)
SpanBert: 79.6% (used in Spark NLP)

Given that LingMess is twice as fast as Allen NLP, and more accurate as well, LingMess represents the state-of-the-art for coreference resolution. Therefore, I will be adapting the Neurosys post-processing code for use with LingMess as well. I’ll include the full code in a future article.

I hope you enjoy using the FastCoref REST API for dramatically improving the accuracy of your AI projects.

Coreference Resolution Is Not a Panacea

While coreference resolution is an essential component of accurate AI, it is far from the only component needed.

First, coreference resolution is only one component needed to address the problem of AI hallucinations. Additional components are needed to fully eliminate hallucinations.

Second, hallucinations are only one issue in terms of AI accuracy. The top five issues that must be resolved are listed immediately below.

Accuracy: The Essence of Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI)

In regards to generative AI, the topic five accuracy issues are:

Hallucinations: Going beyond coreference resolution to finally eliminate this issue.
Misquotes: Quotes get altered during the text generation process, making them unusable.
Time Distortion: LLMs often write about past events as if they are still to come in the future (when the source text was written prior to the event itself)
Extraneous Data: Focusing on irrelevant parts of the provided context.
Input Structure Constraints: LLMs base their responses on both the content of the input prompts and the structure of the input prompts as well.

The first four are likely self-explanatory. However, the last one may need some clarification. The first time that I encountered this problem was creating a React/Nodejs app that automatically transforms video interviews into publishable news articles. Naturally, the first step was to convert the video into a written transcript. However, when ChatGPT was asked to create a news article from the transcript, ChatGPT would often frame its response based on the input structure being a transcript. For example, it would out things such as:

“The host of the show, Karen Webster, sat down with John Smith to discuss…”
“John Smith said … Then he said…”
Etc.

In other words, even though ChatGPT was tasked with writing a news article, it could not escape the structure of the input. That’s because the prompt and the provided context are both converted into tokens and the output is generated from that combined token set.

Therefore, I spent many evenings and weekends exploring how to de-structure input to free LLMs from any implied structure—as well as finding solutions to the other four problems as well.

Update: 100% Accurate AI Finally Here (July 31, 2024)

Since writing this article, an exciting breakthrough has finally occurred. A full solution to eliminating hallucinations has been developed. The following video explains how to finally create chatbots with 100% accurate, hallucination-free responses.

The Road Ahead

I look forward to sharing more practical tips and tricks to improve the accuracy of your AI projects — sharing insights on resolving all five of the above issues. I also look forward to sharing how to create other REST API endpoints for a wide variety of AI, ML, and NLP tasks so that you can incorporate state-of-the-art AI methods in any language of your choosing.

Thank you for reading.