Organizing Portuguese Legal Documents through Topic Discovery

Published in

Jusbrasil Tech

6 min readFeb 3, 2023

A significant challenge in the legal domain is to organize and summarize a constantly growing collection of legal documents, uncovering hidden topics, or themes, that later can support tasks such as legal case retrieval and legal judgment prediction. This massive amount of digital legal documents, combined with the inherent complexity of judiciary systems worldwide, presents a promising scenario for Machine Learning solutions, mainly those taking advantage of all the advancements in the area of Natural Language Processing (NLP).

Brazil, a Portuguese speaking country, has a very large and complex judiciary system with over 27,7 million new legal cases in the year of 2021. It is in this scenario that Jusbrasil, the largest legal tech company in Brazil, with around 2 million accesses per day and a collection composed of billions of legal documents, operates. Besides the ever growing corpus, an additional challenge when dealing with documents from the legal domain is the complexity and uniqueness of the legal language.

As an important step towards building efficient intelligent tools for legal data, we explore the challenging problem of organizing and summarizing Jusbrasil’s large and complex legal document collection. One possible approach to this problem relies on topic discovery techniques, a research area that has resurfaced in recent years boosted by advances in neural topic models and contextual representations for text data, such as the transformer-based autoencoders models. More specifically, in this article we will study a topic discovery approach called BERTopic. To deal with the complexity of the legal language, we have the flexibility of adopting different text representation models including Doc2Vec and BERT.

Image by OpenClipart-Vectors from Pixabay adapted by the author

Topic Modeling

Topic Modeling and Topic Discovery enable us to organize and summarize a large and unstructured collection of documents by discovering abstract topics (themes) that occur in this collection. In this context, a topic consists of a cluster of words that together have some meaning and allow data interpretation and organization. A document in a data collection may exhibit multiple topics in different proportions. Topic modeling is an unsupervised approach, where we assume no previous knowledge about the themes that compose a collection, being impossible to label documents in advance.

BERTopic is a topic discovery approach that tries to find dense clusters of documents from an embedded representation of words and documents. To extract topic representation (keywords), BERTopic proposed a class-based TF-IDF (c-TF-IDF) procedure that works by treating all documents in a cluster as a single document, then TF-IDF is applied to find the importance of words in a cluster (topic) instead of individual documents. To learn more about BERTopic two great articles can be found in (1) and (2).

The main purpose of this article is to show how a topic discovery approach can efficiently organize a legal collection using the syllabus (in Portuguese, ementa jurisprudencial) from each court decision as they concisely summarize the main points presented by the entire decision.

Dataset Loading and Preprocessing

The dataset is composed of 2864 Brazilian court decisions from the Jusbrasil document collection. From the 2864 documents, we do not have any previous knowledge about the topics that comprise 2439 of those documents. The remaining 425 documents are distributed between six different reference collections with topics defined in advance by legal specialists from Jusbrasil staff. Those reference collections are used to support the evaluation of the quality of the topics found by the models. The reference collections are:

Collection 1 (RC1), composed of seven legal decisions about the legality of companies to outsource their main activities;
Collection 2 (RC2), composed of eight legal decisions about an institute of Civil Law, the abusive use of legal personality;
Collection 3 (RC3), composed of eight legal decisions about public employees retirement system;
Collection 4 (RC4), composed of 18 legal decisions about a specific tax charged on properties located in rural areas of Brazil, the ITR, which stands for “Imposto sobre Propriedade Territorial Rural” and can be translate to “Tax on Rural Land Property”;
Collection 5 (RC5), composed of 184 legal decisions related to moral and material damages caused by delays and cancellations of flights;
Collection 6 (RC6), composed of 200 legal decisions associated with Brazilian environment issues, such as the deforestation of the Amazon rainforest.

To illustrate, below we are loading the data (2864 syllabus from Brazilian court decisions) using pandas:

import pandas as pd

df = pd.read_csv('referencecollection.csv')
print(df.columns)


Index(['id', 'corpus', 'label'], dtype='object')

The pandas dataframe df contains 3 columns: id (document id), corpus (syllabus), and label (from 0 to 6; 0 for the unknown documents, 1 to 6 for each of the six reference collections).

After loading the data, each document can undergo a variety of preprocessing steps from basic cleaning to more target procedures, such as the extraction of URL content and removal of plaintiffs. Notice that transformer-based models usually flourish with minimal data cleaning.

In the example below, the preprocessing steps are represented by a call to the method prepare_text_for_topicmodel() that may execute some basic cleaning procedures such as removing punctuation, removing extra white spaces, lower-casing terms, and HTML parsing.

documents = df['corpus'].tolist()
preprocessed_corpus = prepare_text_for_topicmodel(documents)

The following code snippet brings an example of a possible implementation for the method prepare_text_for_topicmodel(). In our experiments we have achieved good results adding a call to CounterVectorizer() with parameters: max_df=0.75 and ngram_range=(1, 2). The parameter max_df=0.75 means that terms with document frequency higher than 0.75 will be ignored; ngram_range=(1, 2) are the lower and upper boundary of the range of n-values for different word n-grams to be extracted.

import re
import nltk

pt_stopwords = nltk.corpus.stopwords.words('portuguese')
punctuation = r'[/.!$%^&#*+\'\"()-.,:;<=>?@[\]{}|]'

def prepare_text_for_topicmodel(documents):
    # remove punctuation
    pp_documents = [re.sub(punctuation, ' ', doc)
                    for doc in documents]
    # convert to lowercase
    pp_documents = [doc.lower() for doc in pp_documents]
    # remove stopwords
    pp_documents = [' '.join([w for w in doc.split() if 
                    len(w) > 1 and w not in pt_stopwords])
                    for doc in pp_documents]
   return pp_documents

Topic Discovery

After loading and preprocessing the data, different topic discovery techniques can be applied to the preprocessed data uncovering unknown themes represented by clusters of words and documents. In this article, the topic discovery approach explored will be BERTopic.

As can be seen in the example below, before calling BERTopic we are loading a contextualized embedding model (load_embedding_model()) and generating embeddings for each preprocessed document in our collection (get_embedding()). The contextualized embedding model and the final embeddings will be passed to the BERTopic model. It is important to notice that they are not strictly necessary. For instance, we could have set the parameter language=’multilingual’ or language=’portuguese’ and a default multilingual language model would be used by BERTopic. However, the calls to load_embedding_model() and/or get_embedding() allow us to experiment with custom text representations, including language models tuned with Portuguese legal text.

from bertopic import BERTopic

embedding_model = load_embedding_model()
embeddings = get_embedding(documents, embedding_model)
model = BERTopic(embedding_model=embedding_model,
                 n_gram_range=(1, 2))
topics, probabilities = model.fit_transform(documents=documents, 
                                            embeddings=embeddings)

A call to fit_transform() will result in two outputs, topics and probabilities. The output topics contains the topics assigned to each document and probabilities contains the likelihood of a document to be associated with any of the possible topics.

An execution of BERTopic using Doc2Vec as the embedding model resulted in 39 topics, with BERTopic being able to group the majority of documents from each individual collection in unique topics. For instance, for RC5, BERTopic found relevant bigrams such as ‘material damage’, ‘moral damage’, ‘air transport’, adding to relevant unigrams, ‘flight’, ‘delay’, ‘baggage’, and ‘cancellation’. Besides finding the six expected topics, the model was also able to uncover unknown themes (topics) existent in the collection. Similar results were obtained using BERTimbau, a BERT-like model pre-trained on the BrWaC (Brazilian Web as Corpus), a large Portuguese corpus.

A similar technique has been applied to a variety of collections from different areas of the legal domain, with sizes up to around 40 thousands documents, successfully uncovering a set of unexpected and meaningful topics that can later support tasks such as legal case retrieval, recommendation, and legal judgment predictions.

Reference:

D. Vianna, E. Moura. Organizing Portuguese Legal Documents through Topic Discovery. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022.

Organizing Portuguese Legal Documents through Topic Discovery

Topic Modeling

Written by Daniela Vianna