Benchmarking Session-based Recommender Systems on the Legal Domain

Marcos Domingues
Jusbrasil Tech
Published in
8 min readMar 15, 2024

Searching for legal documents online is part of the daily routine for legal students, professionals, and ordinary citizens. Legal search platforms offer valuable support to lawyers by facilitating tasks like legal research, court decision tracking, and drafting pleadings. Similarly, ordinary citizens seek updates on their cases, while students find practicality in supplementing theoretical learning with real-world case laws.

In Brazil, we have Jusbrasil, the most prominent and widely used legal search engine platform. Jusbrasil crawls, indexes, and structures legal documents from numerous Brazilian courts, facilitating the rapid and efficient legal information retrieval. However, with the ever-growing number of legal documents, even a search platform can suffer to mitigate the information overload problem. In such cases, we can use recommender systems to assist the search platforms, since they anticipate users’ information needs, mitigating the information overload.

In this article, we show step by step how to benchmark session-based recommender systems for a legal search platform. I mean, for the Jusbrasil search platform.

Session-based Recommender Systems

We start by explaining the session-based recommender systems at a glance. In such systems, we have both input and training data comprising user-item interactions grouped within specific time boundaries, referred to as sessions. Each session may encompass activities such as a music listening session, an e-commerce shopping spree, or a legal document reading session within a search portal, which is our case. We illustrate the session-based recommendation process in the figure below, where each sequence of blocks represents a session, where green or blue blocks denote interactions (e.g. clicks, views, or listens) between users and items within the session. White blocks represent recommendations. In summary, prior anonymous sessions (a) are used as training data for the recommendation models. Then, based on the interactions within the ongoing session (b), the model suggests the next item, additional session content, or even entirely new sessions (c). There exists several different algorithms based on this recommendation process, and we will use three of them in our benchmark.

Following, we show the JusBrasilRec dataset, and how to run a benchmark of session-based models by using session-rec, a Python-based framework for building and evaluating recommender systems. The framework implements a suite of state-of-the-art algorithms and baselines for session-based and session-aware recommendation models.

JusBrasilRec Dataset

JusBrasilRec is a 30-day dataset collected from Jusbrasil between 2021–02–23 and 2021–03–24. We chose this period to avoid recommendation bias, since it consists of the last 30 days before Jusbrasil launched its first recommender system.

An excerpt of the dataset is showcased below, where each line denotes a distinct interaction containing six data fields separated by semicolon:

  • session_id (Integer): session identifier established for each 30 minute interval of user inactivity;
  • user_id (Integer): user identifier;
  • user_type (String): typer of user, i.e. logged or unlooged user;
  • doc_id (Integer): accessed legal document identifier;
  • doc_type (String): type of the document, i.e. precedents, articles, news, models and pleadings, and doctrines;
  • timestamp (Timestamp): date and hour of the access.

The dataset contains a total of 22,442,232 accesses from 2,310,247 users over 4,225,874 items. The accesses are grouped into 5,415,623 sessions.

session_id;user_id;user_type;doc_id;doc_type;timestamp
5246659;931737;logged;21075;articles;2021–03–11T20:41:26.156–03:00
5246659;931737;logged;2708666;precedents;2021–03–11T20:51:41.120–03:00
5246659;931737;logged;2711326;precedents;2021–03–11T20:55:08.363–03:00
5246836;931781;unlogged;4821696;news;2021–03–11T11:31:55.895–03:00
5246836;931781;unlogged;4798653;news;2021–03–11T11:40:04.105–03:00
5246836;931781;unlogged;4880210;news;2021–03–11T11:40:20.909–03:00
5246840;931781;unlogged;60661;articles;2021–03–23T11:26:15.195–03:00
5246840;931781;unlogged;4813493;news;2021–03–23T11:55:00.665–03:00
5246799;931772;logged;4344438;precedents;2021–03–01T13:10:23.705–03:00
5246799;931772;logged;2228117;precedents;2021–03–01T13:10:38.374–03:00

The JusBrasilRec dataset was used in our benchmark, and it is available for download here: https://zenodo.org/record/8401278.

The Framework session-rec

The framework session-rec can be installed and used with Anaconda and Docker. In this article, we show how to install and use it with Anaconda. If you are closer to Docker, you can see how to install and use it in https://github.com/rn5l/session-rec.

1. Installation

First, we download and install the Anaconda following the instruction in its web site (https://www.anaconda.com). Next, we download the framework session-rec from its repository (https://github.com/rn5l/session-rec.git), and unzip the file.

From the session-rec main folder, we must run:

  • If we have a GPU: conda install — yes — file environment_gpu.yml (or conda-env update — name srec37 — file environment_gpu.yml);
  • If we don’t have a GPU or we are using the Windows Operating System: conda install — yes — file environment_cpu.yml (or conda-env update — name srec37 — file environment_cpu.yml).

2. Configuration

Before running the framework, we need to setup two configuration files. The first one is the preprocessing file, that we create and call it as jusbrasil_preprocessing.yml. There, we setup the type of experiment (i.e. single train and test, or slide window train and test) and its parameters, the preprocessing filters, and the input and output folders.

type: window  # single | window
mode: session_based # session_based | session_aware
preprocessor: rsc15

data:
folder: data/jusbrasilrec/raw/
prefix: jusbrasilrec_dataset

filter:
min_item_support: 5
min_session_length: 2

params:
days_test: 1
days_train: 5 # only window
num_slices: 5 # only window
days_offset: 0 # only window
days_shift: 6 # only window

output:
folder: data/jusbrasilrec/slices/

According to the configuration in our preprocessing file, a dataset jusbrasilrec_dataset.dat must be in the input folder data/jusbrasilrec/raw/ with only the session_id,timestamp,doc_id columns separated by commas, and without the header. Additionally, the timestamp column must be converted to the %Y-%m-%dT%H:%M:%S.%fZ format. For the JusBrasilRec, we can obtain such a dataset by using the following Python code snippet.

import pandas as pd
from datetime import datetime

col = ['session_id','timestamp','doc_id']

data = pd.read_csv('jusbrasilrec_dataset.csv', usecols=col, sep=';')
data = data.loc[:, col]

data['timestamp'] = data.timestamp.apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%f%z'))
data['timestamp'] = data.timestamp.apply(lambda x: datetime.strftime(x, '%Y-%m-%dT%H:%M:%S.%fZ'))

data.to_csv('jusbrasilrec_dataset.dat', index=None, header=None, sep=',')

Note that we setup the type as window in the preprocessing file, which means that the models will be trained from the ground up by applying a sliding window protocol, i.e., sessions S1, S2, …, Sm will be split into several slices of equal size in days to train and test the models. As an example, the following figure illustrates the protocol applied to 15 sessions to generate three slices of 5 days, four days for training, and one day for testing. Thus, the protocol allows us to evaluate the variability of models’ performances by considering distinct data splits (i.e. slices) while preserving the chronological order of user-item interactions.

After running the preprocessing step, the preprocessed files to run the benchmark will be in the output folder. Since the JusBrasilRec contains 30 days of data, we have decided to split the dataset into 5 slices of 6 days, with 5 days of data for training and one for testing.

Following, we create the jusbrasil_benchmark.yml file to setup the benchmark parameters. There, we must define the type of evaluation and its metrics, as well as the session-based recommendation algorithms and its parameters. For example, in our configuration file we setup HitRate (i.e. the ratio of session iterations where the next item is part of the recommendation list), NDCG (i.e. Normalized Discounted Cumulative Gain), Coverage (i.e. the proportion of different items in the catalog that ever appear in the recommendation lists) and Popularity (i.e. if high accuracy values are correlated with the tendency of a model to recommend popular items) as the evaluation metrics. We also setup three session-based algorithms for our benchmark: 1) sr (i.e. a variant of Association Rules that quantifies pairwise item co-occurrences within the training sessions), 2) vsknn (i.e. a variant of Session-based KNN that puts more weight on the more recent interactions of the current session when computing the session similarities), and 3) gru4rec (i.e. a model that uses Gated Recurrent Units to deal with the vanishing gradient problem in order to predict the probability of the subsequent interactions given the current session). In our configuration file, we setup the algorithms to use their default parameters. In the folder algorithms, we can see the full list of session-based algorithms and parameters available in the session-rec framework.

type: window  # single | window
key: models # added to the csv names
evaluation: evaluation_next_multiple # evaluation | evaluation_last

data:
name: jusbrasil # added in the end of the csv names
folder: data/jusbrasilrec/slices/
prefix: jusbrasilrec_dataset
slices: 5 # only window

results:
folder: results/jusbrasilrec/

metrics:
- class: accuracy.HitRate
length: [1,3] # length of the recommendation lists
- class: accuracy_multiple.NDCG
length: [1,3] # length of the recommendation lists
- class: coverage.Coverage
length: [3] # length of the recommendation list
- class: popularity.Popularity
length: [3] # length of the recommendation list

algorithms:
- class: baselines.sr.SequentialRules
params: {}
key: sr
- class: knn.vsknn.VMContextKNN
params: {}
key: vsknn
- class: gru4rec.gru4rec.GRU4Rec
params: {}
key: gru4rec

More details about the parameters for both configuration files can be found in https://github.com/rn5l/session-rec.git.

3. Running

Finally, to perform the benchmark, we must run the following commands in a terminal from the session-rec main folder:

  • By using Linux Operating System with a GPU:
conda activate srec37

python run_preprocessing.py jusbrasil_preprocessing.yml

THEANO_FLAGS="device=cuda0,floatX=float32" CUDA_DEVICE_ORDER=PCI_BUS_ID python run_config.py jusbrasil_benchmark.yml

conda deactivate
  • By using Windows Operating System, or using Linux without a GPU:
conda activate srec37

python run_preprocessing.py jusbrasil_preprocessing.yml

python run_config.py jusbrasil_benchmark.yml

conda deactivate

After running the benchmark, the five result files, i.e. one file for each slice, will be in the folder results/jusbrasilrec. As can be see below, in each file we will have the metrics computed for each session-based recommendation algorithm. Thus, we can compute the average and standard deviation values for each algorithm, or even compare them by using some statistical test.

File 1

Metrics;HitRate@1: ;HitRate@3: ;NDCG@1: ;NDCG@3: ;Coverage@3: ;Popularity@3:
sr;0.48824666;0.61540815;0.53545323;0.43865759;0.17500210;0.02816248
vsknn;0.50621478;0.65752697;0.56019160;0.47439661;0.17420346;0.03219068
gru4rec;0.46153598;0.60414587;0.50836419;0.42768151;0.22427339;0.01841212

File 2

Metrics;HitRate@1: ;HitRate@3: ;NDCG@1: ;NDCG@3: ;Coverage@3: ;Popularity@3:
sr;0.48944881;0.61615966;0.53825920;0.44026530;0.16414151;0.02892650
vsknn;0.50364833;0.65676688;0.55952505;0.47380831;0.16166904;0.03335590
gru4rec;0.46087144;0.60553417;0.50922857;0.42932515;0.20786254;0.01967656

File 3

Metrics;HitRate@1: ;HitRate@3: ;NDCG@1: ;NDCG@3: ;Coverage@3: ;Popularity@3:
sr;0.48164393;0.61504407;0.53036589;0.44231697;0.29496801;0.02912842
vsknn;0.49517732;0.65669901;0.55072311;0.47652975;0.30576006;0.03290822
gru4rec;0.45366887;0.60372278;0.50234395;0.43084820;0.35981201;0.01888675

File 4

Metrics;HitRate@1: ;HitRate@3: ;NDCG@1: ;NDCG@3: ;Coverage@3: ;Popularity@3:
sr;0.47471330;0.60489583;0.52485728;0.43491087;0.33298287;0.02695349
vsknn;0.48625031;0.64847450;0.54326018;0.47012268;0.34923885;0.03176573
gru4rec;0.44907711;0.59367135;0.49900161;0.42395539;0.40346956;0.01817907

File 5

Metrics;HitRate@1: ;HitRate@3: ;NDCG@1: ;NDCG@3: ;Coverage@3: ;Popularity@3:
sr;0.48112233;0.60893597;0.53050567;0.43655931;0.33799109;0.02378432
vsknn;0.49328199;0.65210877;0.54982052;0.47159094;0.35694654;0.02754076
gru4rec;0.45321555;0.59935983;0.50293941;0.42641119;0.40921109;0.01603101

As an example, in the chart below we plotted the average and the standard deviation for the HitRate@3, i.e. the HitRate in lists with three recommendations. There, we can compare the three algorithms, and see that the algorithm vsknn provides the best results in terms of HitRate@3. Similar analyses can also be conducted with the other evaluation metrics.

To conclude this article, it is worth to say that by using the step by step presented here, we ran a large scale benchmark for Jusbrasil with several algorithms, metrics, analyses and conclusions that can be found in https://doi.org/10.1007/s10506-023-09378-3.

Reference:

Marcos Aurélio Domingues, Edleno Silva de Moura, Leandro Balby Marinho and Altigran da Silva. A Large Scale Benchmark for Session-based Recommendations on the Legal Domain. Artificial Intelligence and Law, 2023.

--

--

Marcos Domingues
Jusbrasil Tech

Professor at the State University of Maringá, and Postdoc researcher in Intelligent Methods for Lawtech at the Federal University of Amazonas and Jusbrasil.