An Intelligent Search Tool for the Legal and Compliance Industry — eDiscovery and Due Diligence

Kaushik Shakkari
Cognistx’s SQUARE/DIP Blog
5 min readNov 20, 2022


Check out my Open Domain Question Answering series for more context and technical details reading question-answering modeling.

Photo by author: Stunning view at Avalanche Lake, Glacier National Park, Montana (Sept 2022)

Organizations across the legal and compliance industry have to deal with complex unstructured documents like contracts, invoices, agreements, etc., daily. Managing these documents, including searching and validating information, is essential to reduce risk and staying compliant.

Traditional search tools use simple keyword matching to find information in a document (for example, using ctrl+f in a pdf viewer like Adobe Acrobat to search or validate relevant information). However, exact keyword matching is an inefficient approach to searching for information.


Limitations of the exact keyword matching:

  1. Users might not be aware of the exact keywords used in documents. For example, in the privacy domain, users might search for information with the “personal data” keyword and see zero exact matches in a document. However, documents might use “personal information” instead of “personal data.
  2. Some keywords might have many matches across different documents, but their meaning is different in different contexts. For example, in the information security and compliance domain, the keyword “access” in the context of “secure remote access to prevent unauthorized access to data is different from “access” in the context of “physical access to enter into the office building.
  3. Keyword-based searching is not scalable. Imagine searching manually using keywords in a pdf viewer across millions of documents 😰.

Today’s state of the art AI models can enable search tools to use semantics (meaning) and layout (structure) of documents to find information accurately.

How can AI models use semantics?

It is essential to map synonyms and acronyms while extracting information. However, explicitly adding rules to map synonyms and acronyms is a tedious task at scale. Today, many AI language models can understand semantic relationships as humans do. They know “apple as a company” is different from “apple as a fruit.” These language models have complex neural architecture and are trained on the large corpus to mimic how a human brain understands the language. As of November 2022, BLOOM is the world’s largest open-source language model. It has 176 billion parameters and was trained for 3.5 months on 384 A100–80GB GPUs.

Screenshot from author: Cognistx’s SQUARE DEMO

In the above screenshot, I showed an example from Cognistx’s search tool, SQUARE, where a user asked, “Is DNA considered personal data?” in an Official Journal of the European Union GDPR document. SQUARE understood “DNA” is related to “genetic data” and retrieved a relevant answer, “Genetic data should be defined as personal data.”

How can AI models use layout?

Structured Data Extraction using AWS Textract

Often, unstructured data like contracts, bills, and standard documents might contain some structural information in tables and forms. The location of that structured information can help to label information respectively (For example, Amount = $ 552,500 in the above figure). Models like LayoutLM can use this structured information to answer a question in a natural language form (for example, what is the base loan amount?).

An intelligent search tool can improve organizations’ productivity and reduce costs by efficiently automating the manual process of searching information across documents.


Semantic and layout-aware AI models for search can empower many use cases in the legal and compliance industry. The following describes the top use cases that can benefit from these models.

Screenshot from author

Open Search for eDiscovery:

In the open search use case, a centralized repository is created to store data that is extracted from trusted sources either by crawling the internet, referring to relevant databases, or manually adding relevant documents in different formats. Users are naive and sometimes not sure what questions to ask in the repository. Users interact with a google search like interface (as shown in the above SQUARE GDPR example) and get relevant answers for their natural language queries. Often, users explore using this interface, get answers, and frame new questions based on knowledge from previous answers. This tool helps users to understand the domain better and avoid spending money on consulting domain experts. This tool can also help to avoid bias from different consultants. Also, the information in the centralized repository can be updated every day or every hour so that the information that users get is the latest.

Constrained Search for Due Diligence:

In the constrained search use case, a template of entities is created by consulting the customer. For example, the entities for the lease contract due-diligence process include the licensee name, revenue share percentage and buy-out amount, etc. These pre-defined entities need to be extracted and validated. Often the users of constrained search are domain experts. However, manually extracting and verifying entities at scale in a limited time is not feasible. Hence, these tools help domain experts like attorneys to validate pre-defined entities in a constrained time.


I explained open and constrained search use cases for the legal and compliance domain. However, these applications can also be applied to other industries like aerospace, healthcare, information security, etc.

More about Cognistx’s SQUARE (Scalable Question Answering and Recommendations Engine)

Team SQUARE at AI for Legal and Compliance event, Duquesne Club, Pittsburgh

SQUARE is our scalable and production-ready intelligence search system that takes user queries and provides granular results across millions of documents in just a few seconds.

Open Search Use Case (Gaming Compliance):

SQUARE powers Odds on Compliance’s PlayBook AI platform and enables its customers to ask natural language questions and get accurate granular responses for gaming compliance documents and websites across multiple states in the United States.

Analytical Objectives:

  • Auto scalable intelligent search; end-to-end platform
  • Tuned deep-learning models for sports betting, iGaming and gambling regulatory compliance domain
  • Everyday crawling relevant websites and updating information for search

Constrained Search Use Case (Legal Compliance):

SQUARE powers Solvaire and enables its attorneys to avoid human dependent review process. It extracts and validates information from various lease documents to conduct its operations efficiently in a constrained time.

Analytical Objectives:

  • Scalability and efficiency of the AI platform to extract information with ease
  • Increase accuracy periodically and reliability of the process by reducing human errors (Feedback Loop and Self Learning Pipeline)

Add me on LinkedIn. Happy Learning!



Kaushik Shakkari
Cognistx’s SQUARE/DIP Blog

Senior Data Scientist | My biggest success till day was converting my passion to my profession (Data Science 🚀)