Building a machine-learning model to recognise key contract clauses using SpaCy

15 min readApr 13, 2023

This article provides an overview of building a machine-learning model to recognise contractual clauses using SpaCy, focusing on the concepts that guide the machine learning process.

What is SpaCy?

SpaCy is an open-source library that is designed to be an efficient, fast, and user-friendly way of processing natural (human) languages for machine-learning purposes. This branch of AI that is concerned with helping computers make sense of natural languages is called Natural Language Processing (NLP).

In coding, a library is a conveniently packaged code that we can import into our own code, meaning that SpaCy is not a software or a web application. The beauty of using such a library is that developers can save a lot of time leveraging the pre-built models, algorithms, and functions included in the library rather than having to build everything from scratch.

This article covers:

1) Finding data to train on

2) Annotating data manually

3) Weak-labelling using patterns

4) Training the model

5) Testing the model

1. Finding suitable contractual data

Machine Learning starts with data. A computer model has to be provided a large amount of data from which it will derive patterns for analytic and prediction purposes. However, it also has to be provided with a way to make sense of such data, since computers natively have no way of understanding the data we provide it.

Although the increase of machine learning in many industries acted as a catalyst to push the need for data to be properly structured and managed, acquiring a sufficient amount of data to enable accurate performance is still one of the biggest challenges in building AI models. This is especially the case when it comes to legal data, which is often confidential and not publicly available.

Where to find contractual data to train on?

One of the easiest ways to gather a substantial amount of relevant data is by searching for datasets online. Typically, data within a dataset is well-organized in a structured format, often in a table layout, which can make it easier for us to access and process the data further. There are several platforms where users can upload their datasets, such as Kaggle, providing a vast collection of free datasets on a variety of topics. Additionally, you can find many of the known legal datasets available in this GitHub repository.

For my project, I utilized the Contract Understanding Atticus Dataset (CUAD). This corpus consists of 510 commercial contracts in both .txt and .pdf formats, with each contract’s clauses accurately labeled and grouped into 41 categories such as Effective Date, Termination, and Governing Law. These labeled clauses were then saved in an Excel table.

2. Labeling and annotating data for the model to train on.

What is data annotation?

The next step is to annotate our contractual data. Data annotation is the process of labeling data to show the model the outcome of what we would like it to predict. For example, if we were to train a model to detect fruit within a text, we would first want to go over as many texts as possible and label them for any occurrence of a fruit.

The farmer picked a basket of fresh peaches from the orchard.

We know that the word ‘peaches’ refers to a fruit, and so we would annotate it as one. From this, the model can learn to associate the word “peaches” with other words and concepts that appear in the words or sentences in close proximity to it, and that are related to fruits, such as “juicy,” “sweet,” “orchard,” and so on. This allows the model to develop a richer understanding of language and make more accurate predictions and recommendations. I explain the training process in more detail below.

There are many both free and paid annotating tools available online in which we can highlight and annotate text with our chosen labels and then export it to a structured format from which a machine learning model can learn.

https://tecoholic.github.io/ner-annotator/

Once we export the annotation into a JSON format, we see that it is structured as follows:

{
   "classes":[
      "FRUIT"
   ],
   "annotations":[
      [
         "The farmer picked a basket of fresh peaches from the orchard.",
         {
            "entities":[
               [
                  36,
                  43,
                  "FRUIT"
               ]
            ]
         }
      ]
   ]
}

JSON, short for JavaScript Object Notation, is a lightweight data-interchange format that is easy for humans to read and write, and for machines to parse and generate. It is widely used in web applications for data transmission, and in various other contexts where a simple and flexible format for structured data is needed. It is also one of the ways data can be structured for machine learning purposes. In our example, note the numbers 36 and 43; they indicate the exact character location of the word we labeled as FRUIT.

3. Creating patterns to annotate our data

While the process of manually annotating data is one of the most precise ways of producing training data, it is also the most time-consuming way of doing so. With our fruit example, we could simply create a list of a couple of hundred examples of fruits (this is something that can be easily found online) and then write a code to go over each fruit, find it within a text and annotate it as ‘fruit’. This is called ‘weak labeling’ and can, in many instances, be the easiest way to obtain annotated data. With our use case, annually annotating hundreds or thousands of contracts would take months, and, therefore, I made use of weak labeling patterns.

Because legal contracts are some of the most formulaic documents, it is not difficult to find frequently occurring patterns. Although this method can be less precise than manual annotation, it can still yield useful results and can be particularly effective when dealing with large datasets.

As an example, the Governing Law clause in contracts often contains the phrases shall be governed by or shall be governed in accordance with. This means that instead of going over every contract we have at hand and annotating every Governing Law clause, which would also require a lot of time reading the documents, a simple Python code can go over all sentences matching our patterns and annotate them accordingly.

To first confirm what the commonly occurring patterns are, I wrote the following code to go through all of our clauses structured in an Excel table provided by CUAD and find some of the common wordings for each clause. In the below example, I will be looking for the most common wordings from the ‘Anti-assignment’ clauses in our CUAD Excel table dataset.

import nltk
import pandas as pd


textopen = open("find.txt", "r", encoding="utf8")
text = textopen.read()

lenght = [4,5,6,7,8,9,10,11,12,13,14,15]
ngrams_count = {}
for n in lenght:
    ngrams = tuple(nltk.ngrams(text.split(‘ ‘), n=n))
    ngrams_count.update({‘ ‘.join(i): ngrams.count(i) for i in ngrams})

df = pd.DataFrame(list(zip(ngrams_count, ngrams_count.values())),
                  columns=[‘Ngram’, ‘Count’]).sort_values([‘Count’],ascending=False)
print(df)
df.to_csv(‘output.csv', index=False)

What this gives us is the most commonly used phrases that are between four to fifteen words long.

Below, we can see that the phrase ‘prior written consent’ has been used 247 times, but because the phrase could easily occur in other clauses as well, I selected some of the more specific wordings relevant to the Assignment clause, such as ‘may assign this agreement’, occurring 59 times.

Once I decide which patterns to use, I can use SpaCy’s SpanRuler to find phrases matching my patterns in text, annotate such phrases and convert them to a suitable training format. Some of the patterns used in this model are:

ruler = nlp.add_pipe("span_ruler")
patterns = [{"label": "Governing Law", "pattern": "the laws of the"},
            {"label": "Governing Law", "pattern": "shall be governed by"},
            {"label": "Governing Law", "pattern": "governed in accordance with"},
            {"label": "Governing Law", "pattern": "in accordance with the laws of"},
            {"label": "Assignment", "pattern": "may be assigned"},
            {"label": "Assignment", "pattern": "may not be assigned"},
            {"label": "Assignment", "pattern": "shall not assign"},
            {"label": "Assignment", "pattern": "shall not be assigned"},
            {"label": "Assignment", "pattern": "shall not be assignable"},
            {"label": "Assignment", "pattern": "the right to assign"},
            {"label": "Assignment", "pattern": "no assignment"},
            {"label": "Pricing", "pattern": "calculated as follows"},
            {"label": "Pricing", "pattern": "the price shall be"},
            {"label": "Pricing", "pattern": "shall pay"},
            {"label": "Pricing", "pattern": "undertakes to pay"},
            {"label": "Notices", "pattern": "Notices under this"},
            {"label": "Notices", "pattern": "any notice required"},
            {"label": "Notices", "pattern": "any notice served"},
            {"label": "Notices", "pattern": "all notices provided"},
            {"label": "Term", "pattern": "shall commence on the Effective Date"},
            {"label": "Term", "pattern": "come into force on the date"},
            {"label": "Term", "pattern": "effective until terminated"},
            {"label": "Term", "pattern": "this agreement commences"},
            ...

ruler.add_patterns(patterns)

This, therefore, means that the code runs over all 510 contracts, matches any of the above patterns, and labels the matched text accordingly. As such, the phrase this agreement commences will be labeled as ‘Term’. In our fruit example above, this would all be done manually word by word.

However, with this method, our text will only annotate the exact phrase contained in the pattern, and so the machine learning model would view this agreement commences as a Term clause, rather than the whole sentence it appears in. Therefore, I wrote the code below to extend the annotation after matching the phrase to include the whole sentence.

 doc.spans["test"] = SpanGroup(doc)
    db = DocBin()
    for sentence in doc.sents:
        for span in doc.spans["ruler"]:
            if span.start >= sentence.start and span.end <= sentence.end:
                doc.spans["test"] += [
                    Span(doc, start=sentence.start, end=sentence.end, label=span.label_)

                ]
                doc.set_ents(entities=[span], default="unmodified")

After this, the whole sentence ‘This Governing Law Agreement shall be governed by and construed in accordance with the laws of the State of New York, without giving effect to New York conflict laws’, rather than just the phrase shall be governed by, will be labeled as ‘Governing Law. This will give the model an idea of what the entirety of the sentence looks like.

More accurate patterns can be developed by incorporating additional linguistic information, such as dependency relations between words and part-of-speech tagging. These techniques can help identify not only the specific text that needs to be annotated but also the context in which it appears, providing a more accurate and contextualized annotation.

Let’s take a look at the pattern below, which is designed to find instances of “Effective Date” in a text.

patterns = [{"label": "Effective Date", "pattern": [
    {"ENT_TYPE": "DATE", "OP": "{4,}"},
    {"TEXT": '('},
    {"TEXT": 'the'},
    {"TEXT": '"'},
    {"TEXT": 'Effective'},
    {"TEXT": 'Date'},
    {"TEXT": '"'},
    {"TEXT": ')'}
]}]

The pattern is using the already-labeled ‘DATE’ entity from SpaCy’s pre-trained models that are able to recognise, for example, names, dates, and organization names. This means that we do not have to train the model to recognise dates, as this is one of the entities that a downloadable SpaCy model can recognise. It is often the case that one machine learning model is used to help annotate the data of another.

Therefore, with this pattern, we are able to find a date within the text that is at most four tokens (words) long (e.g. 20th of January 2024) and only annotate it if the phrase “(the “Effective Date”) follows right after it. This pattern makes sure not to annotate any other occurrences of Effective Date and/or a date.

This also illustrates that we could use some of the other pre-labeled entities (such as organization) to distinguish parties of a contract.

Using the DisplaCy visualisation tool, we then find that this pattern indeed correctly recognised and annotated the Effective Date as follows:

It is also important to note that with the advent of Large Language Models (LLMs) such as the renowned GPT 3.5 and now 4, the need for manually collecting and processing data may soon become obsolete. In short, because LLMs are pre-trained on large amounts of data and are able to recognise a large range of language patterns and linguistic structures without human labeling assistance (the so-called unsupervised learning), they are able to recognize and extract meaningful features from text. Therefore, a model like GPT could recognise legal clauses by giving it a prompt similar to “please find and label any clause discussing Termination”, which, in many cases, could be far more precise as compared to using strict patterns.

Source: https://ljvmiranda921.github.io/notebook/2023/03/24/llm-annotation/

An interesting, more in-depth article discussing this concept further: How can language models augment the annotation process?

4. Training

After matching our patterns against all 510 contracts, annotating the data, and outputting the training data in .spacy format which is required to train our SpaCy model on, we can begin the training using the SpaCy library.

How does SpaCy work?

Computers do not understand natural languages but instead only operate in sequences of 1’s and 0’s called the binary code (think of the 1’s and 0’s as on and off switches, which allow or block the passing of electric power to perform logical functions — processors have hundreds of millions of such switches). In the early days of computing, programmers wrote code directly in binary, but as computers became more powerful and tasks more complex, it was clear that using binary code to write programs was neither practical or efficient. As a result, modern programming languages were developed to provide a layer of abstraction over the underlying binary code, allowing programmers to work with a more human-readable code that will then be translated (or compiled) into binary code. Therefore, SpaCy first converts natural language text to numerical values so that they can be processed mathematically, and only then compiles it to binary code for the computer to understand it.

One example of such a processing method is GloVe — Global Vectors for Word Representation — a learning algorithm for obtaining vector representations for words that SpaCy used until recently. GloVe learns the meaning of words by looking at how often they appear together in a large corpus of text and represents each word as a vector of numbers that capture its meaning based on the other words it appears with. This allows GloVe to understand the relationships between words and use that knowledge to perform NLP tasks.

The example below compares pairs of male-female nouns. The closer the words are semantically (in meaning) to nobility the closer they appear to the top of the graph (y-axis), and the more they refer to a female title the closer they are to the right (x-axis). Note how the male titles of nobility tend to cluster around the upper right corner and note that the vector differences (the distances) between each pair are roughly equal.

In the pictures below the distance between the pairs woman and king and man and queen is slightly bigger and both are also oriented differently. This could, among other things, mean that the words woman and king have a low probability of occurring together in the text compared to, for example, queen and empress.

This is all the result of statistically comparing words based on co-occurrence with other words, meaning that the words queen and empress both tend to co-occur with a common collection of other words (e.g. the pronouns she and her, nouns like royalty, and verbs like reign), and as such the distance between them is small. Based on the above graph, if we trained our model to recognise ‘male royalty titles’, it may then assign the word ‘king’ a confidence score of, let’s say, 0.94664 (meaning it is 94% confident that it is a ‘male royalty title’ in our two-dimensional space).

However, instead of just one property (one meaning) being represented with each word, SpaCy will represent each word in 300-dimensional vector space. Think of dimensions as scales that, when put together, capture the meaning of a word. Each scale could evaluate a word on where it falls within the formal-informal range, whether it is big or small, what the sentiment of the word is (positive-negative), and another 297 features that the model would try to extract from each word. As such, the word ‘king’ would be represented by 300 confidence scores that together suggest a meaning of the word within the sentence. The first 26 dimensions for the word ‘king’ in the Google News vector dataset, trained on about 100 billion words, are represented as follows:


KING [ 1.25976562e-01  2.97851562e-02  8.60595703e-03  1.39648438e-01
 -2.56347656e-02 -3.61328125e-02  1.11816406e-01 -1.98242188e-01
  5.12695312e-02  3.63281250e-01 -2.42187500e-01 -3.02734375e-01
 -1.77734375e-01 -2.49023438e-02 -1.67968750e-01 -1.69921875e-01
  3.46679688e-02  5.21850586e-03  4.63867188e-02  1.28906250e-01
  1.36718750e-01  ...]

Unfortunately, there is no way for us to tell what each of these dimensions refers to (such as the big-small or king-queen concepts I highlighted above). However, they all capture unique correlations in word statistics, which then helps the model make sense of words.

Therefore, in short, SpaCy will pre-process text by splitting the text into smaller units (or tokens) and then statistically process the tokens into numbers, or vectors. Mathematically, the vectors can be then used to obtain the syntactic dependencies between words.

Setting up SpaCy

The SpaCy documentation explains in detail how to set up a config.cfg file in which we can set all the attributes for our training. The config file allows us to adjust or add components to improve the accuracy and/or efficiency of the training. Most importantly, the config specifies our training and validation data directory. Training data means the data our model trains on, and validation data is the data the model will try to make predictions based on what it learned during training. So far we did not split our .spacy training data in those two directories, and so for the purposes of this model, I split it in a 70:30 (training:validation) ratio. This means that the model will train on 70% of our annotated material, and then test itself on the other 30% of our annotated material. Other attributes and components can be adjusted or added to improve the accuracy and/or efficiency of the training.

One of the components I used for the training of this model is the Sentence Suggester. A suggester is a function that will, in our case, propose candidate clause wordings that the model will then predict labels for. This means that the text will be split into sentences, and all such sentences will then be evaluated by the model, and the ones receiving the highest confidence score will be classified with the corresponding label.

Once we run the below command, our model will start training. The trained model will be saved to the /models directory in our training folder.

python -m spacy train config.cfg --output ./models

The training will be visualised using SpaCy’s built-in scorer and loss functions.

The closer the Loss evaluation is to 0 and the scorer to 1, the more confident our model is at making the predictions. The next three columns (SPANS_TEST_F, P, and R) display the F-score, precision, and recall score, the metrics used to evaluate the performance of our model. F-score is a metric that combines both the score of the precision and recall values. Precision, using the ‘Governing Law’ clause as an example, measures how many of the sentences the model labeled as ‘Governing Law’ are actually correct, and recall is about how many of all the ‘Governing Law’ clauses in our data the model correctly identified. Because all of our training data is annotated, the model will first predict and then confirm with the annotations we made.

The overall score (the column most to the right) of our model peaked at around 0.94, which is very high, but that does not mean our model will is 94% perfect, but rather that based on the validation dataset and our annotations, it was 94% correct in its prediction. For this reason, the more data we have and the more precise our patterns are, the more we can trust the overall score as a reflection of actual performance.

5. Testing our model

Finally, once our model finishes training, we can test it using SpaCy’s visualisation tool DisplaCy. The images below show some of the predictions our model made on a contract it has not trained on.