Exploring the capabilities of Natural Language Processing (NLP) in conducting legal analysis: An experiment using POCSO Judgments

Published in

CivicDataLab

9 min readMar 27, 2023

The legislative department of the Government of India maintains a chronological list of central acts. The oldest act dates back to 1836. In fact, there are more than 60 such acts which have been in existence for more than 150 years. But how often are these laws enforced? Do we still need these acts or should they be repealed? Legal researchers study orders and judgments published by the courts for cases under specific acts to answer such questions. But conducting such studies takes a lot of time and resources as researchers have to go through these lengthy judgements manually to look for any patterns that might help them in analysing these laws. Let’s try to understand this process using an example.

The current approach of performing a quantitative analysis of law

In 2022, activists and researchers who work to strengthen the child rights ecosystem got together to evaluate the impact of the legislation which completed 10 years. The Protection of Children from Sexual Offences (POCSO) Act, 2012 came into effect to tackle the rise of child sexual abuse cases. But has the law managed to address this problem or are there any areas of concern which should be addressed through amendments? The first step in the research process is to curate judgments of cases filed under the POCSO Act. Then researchers go through the judgements to find details such as the time taken by the courts to dispose of cases, how victims testify, the number of effective hearings, how courts appreciate the evidence and determine the age of the child, and factors responsible for the conviction and acquittal of such cases. While reviewing the judgements, they often annotate or highlight the variables as depicted in the following figure.

Figure 1: Manual annotation of a sample POCSO case

You can identify that the researcher who annotated this copy has an eye for certain parts of the judgement. The sections involved [4 of POCSO, 366(A) IPC], age of the victim (16), type of the crime committed (kidnap) etc. These specific variables are essential for a researcher to perform a comprehensive analysis. Researchers have been conducting analyses of court orders and judgments through such annotations to analyse the implementation of statutes. But there are nearly 47,000 POCSO cases registered in 2021 alone. And we do not have enough legal researchers to annotate all these cases. Artificial intelligence (AI) comes to the aid of legal research here.

Aiding legal research through OpenNyAI

Using such orders and judgments which have been manually annotated from the past, computers can be trained to identify similar patterns in newer cases. This way, the laborious task of manually annotating the judgement or orders can be left to the machine whereas a researcher can focus more on creative pursuits like identifying relationships between the variables and hypothesising new theories. The developers of OpenNyAI exactly did this — they trained computers on what are crucial parts of Indian legal texts.

Currently, up to 14 variables or “named entities” can be extracted from Indian legal texts using OpenNyAI. These include COURT, JUDGE NAME, STATUTE, and PROVISION among others. More about these entities can be found here: Link

The attached image shows how OpenNyAI annotated the sample POCSO judgement considered.

Entities annotated by OpenNyAI’s Named Entity Recognition Model — Figure 2: OpenNyAI’s Entity Recognition

As of today, some of these variables are also available on the eCourts portal for most cases. Many researchers rely on this information to conduct quantitative research. Here several important details related to a case such as Acts, Sections, Case History, Filing and Registration Dates, etc are made available for each case. The image below shows how case-related information is displayed on the portal.

For the chosen POCSO case, this image shows the official metadata on the website — Figure 3: Metadata of the sample POCSO case

As we can see from the image above, some of the variables which were marked in the judgments are also available here. For e.g. Section 4 of POCSO is mentioned under the Acts heading, which is mentioned in the judgement as well [Figure 1].

However, the metadata is not perfect and exhaustive. We can observe that metadata did not include Sec 366(A) IPC. Similarly, we observed in multiple other cases that metadata did not capture all the POCSO Sections under which charges were framed. We even observed typo errors in metadata for a few cases.

Errors on metadata. A section not present in POCSO is identified as a POCSO section — Figure 4: Errors in metadata CNR: ASSN010000912018. There is no Section 46 in POCSO

Thus, by looking at such cases, we can say that relying on metadata alone for quantitative research may not be sufficient for good research.

But can we use OpenNyAI for generating much more relevant data points than what is provided in the metadata?

We designed an experiment to check if OpenNyAI can be used to identify relevant POCSO Sections which are part of the judgment but are not present as part of case details on the eCourts website. For this study, we curated POCSO judgements delivered between 2017–2019 by district courts in Assam. We limited our research to only one state since the judgements published within a state might have some similarities in the way they are written. The more similarities, the better it is especially when we are trying to train machines to find patterns. In technical terms, we refer to this task as Natural Language processing. We also had access to these judgments from Assam because of our past work analysing POCSO cases. This was another important reason to select Assam as our state to conduct this experiment.

There were 1,764 POCSO judgements available in Assam during the time period considered — most of them delivered in the districts of Barpeta, Sonitpur, Dhubri and Sivasagar. Out of the 1,764 judgements, we filtered judgements from the above four districts. Then we selected 51 judgements from the dominant case types (Special (POCSO) Case, POCSO Act) under which the judgements were delivered.

We define “Relevant POCSO Sections” as those POCSO Sections under which the charges were framed and the court order was given. The charges can be identified in the “preamble” part of the judgement and the court order can be identified in the “decision” part of the judgement. But, the judgements do not have the “preamble” and “decision” parts clearly delineated. So, we used another OpenNyAI algorithm called Extractive Summariser which divides the judgment into five parts — preamble summary, facts summary, issue summary, analysis summary and decision summary.

This is the preamble summary extracted by the OpenNyAI model. Showcasing how relevant POCSO sections are identified. — Figure 5: Preamble Summary identified by OpenNyAI Extractive Summarizer

This is the decision summary extracted by the OpenNyAI model. Showcasing how relevant POCSO sections are identified. — Figure 6: Decision Summary identified by OpenNyAI Extractive Summarizer

We then used these sub-sections within a judgment to identify the relevant POCSO sections. The results of the experiment are as follows:

1. Metadata captured all relevant POCSO Sections for only 28/51 judgements

In these 28/51 judgements, the POCSO sections mentioned in the metadata are consistent with the POCSO Sections under which the charges were framed (preamble summary) and with the POCSO Sections mentioned in the court order (decision summary). The sample POCSO judgement discussed in this blog is one of these 28 judgements. We can see that only Section 4 of the POCSO Act is in the metadata, preamble summary and decision summary.

2. In 20/51 judgements, we identified additional sections than what is mentioned in the metadata.

Sections which were part of the court order: Metadata only has the POCSO Section under which the charges were framed. But these charges often change during the trial and the conviction/acquittal might happen on different POCSO Sections. If the researchers use metadata alone, they would miss data on what section of POCSO under which the actual conviction/acquittal of the accused is taking place. The OpenNyAI model helped us identify those sections.
Sections which might have part of the chargesheet but not the FIR: Metadata often has the POCSO Section that is mentioned in the FIR, not even the charges. It is well known that the police include an exhaustive list of sections in the FIR. Not each of them would make its way to the chargesheet or final charges framed by the Court, upon which the actual prosecution happens. Legal research based on POCSO Sections mentioned in the FIRs would not be sound research.

3. There are data entry mistakes in 3/51 judgements’ metadata

Sections that do not even exist in the POCSO statute are entered in the metadata in 3 judgements.

The code we wrote for experiment can be accessed here: Link

So, we can say that OpenNyAI helped us not only in identifying the missing sections but also in detecting discrepancies with data. But we need a few more data-intensive studies, across multiple cases and states, to further test and validate the feasibility of using OpenNyAI for important legal research use cases.

The above discussion is only one of the many use cases of OpenNyAI. The use-cases list is evergrowing and is being crowdsourced by OpenNyAI here: Link. You can go through the use cases, vote for the ones you like, and suggest more.

How can you use OpenNyAI?

As a programmer:

Being an AI model, it generally requires some programming knowledge of Python to automate the use of OpenNyAI on multiple judgements. OpenNyAI’s GitHub repository is a good place to start practising it.

There are a few challenges in programmatically using OpenNyAI models:

The input to the models is judgement texts in TXT format. However, judgements are available in PDF format. And often they are not machine-readable, especially at distinct court levels. So a lot of time goes into converting PDFs into TXTs and often minor bad data creeps into the TXT during conversion.
The model is computationally intensive. With large judgement texts, I witnessed the model failing to execute a couple of times.
Model can only run on the entire judgement text. If you input only a part of the judgement text, the model often fails to execute.

As a legal researcher:

It is expected that the open-source community would build various applications on top of OpenNyAI. This is akin to how ChatGPT and various other bots are being built on top of the GPT-3 AI model, thereby democratising the application of AI. Similarly, an application called the Judgement Explorer has been built on top of OpenNyAI. All you need is a judgement link from indiankanoon.org and you can run OpenNyAI to get judgement summaries and to identify entities like COURT, JUDGE, STATUTE, PROVISION etc., in a judgement.

We can hope that several other legal-tech, EdTech applications can be similarly built enabling us to engage with Indian judicial data much more effectively.

Final thoughts

This experiment allowed us to scratch the surface of this humongous archive, which remains mostly untouched and unstructured.

The experiment also reminded me of the importance of interdisciplinarity in any work. Engineers often can fall into the trap of Technosolutionism, which is a naïve belief that technology (AI off late) can fix any problem. While I used an AI model for this experiment, it would have remained meaningless if not for the insights from Sri Harsha Kandukuri.

Harsha is a Child Rights Research Consultant at CivicDataLab. His expertise in POCSO Act is what made meaning in this project. He helped design the experiment, defined the scope of the model outside of which it would not be useful and provided all contextual links that connected the dots.

I’m glad that we could build some intersection of AI-Data-Law through this experiment. In the next few months, we would further deepen this intersection to datafy the archives of the Child Rights ecosystem in India.

Stay tuned for more on it!