Are legal professionals a thing of the past? Human lawyers took on Alibaba’s AI to help us find out…
Artificial intelligence (AI) is gradually being implemented in numerous domains and applications, from autonomous vehicles and medical diagnoses, to playing games like Chess and Go. Alibaba is playing a very active role in using AI to assist in the preparation and review of legal documents, and their unique approach has achieved considerable success.
To test the effectiveness of their AI system, Alibaba invited eight lawyers to review over 600 online legal agreements alongside the AI to see who was faster and, crucially, more accurate.
In the end, the AI seemed to win out over the human lawyers. What took all eight lawyers an entire week, took the AI system just one second to complete. Furthermore, the AI marked up errors in the agreements to within an accuracy of 100%. This intelligent contract diagnosis system, independently developed by the engineers at Alibaba, uses AI to verify online agreements.
This study clearly shows that AI can successfully be implemented in the legal domain and that it can even improve accuracy when dealing with large volumes of legal documents when compared to professional lawyers.
This article introduces the potential applications of AI in the legal domain and what the key technologies are behind Alibaba’s unique approach.
A Quick Legal Background
The protection of consumer rights on the Internet has become a new focus of public concern. This includes problems with customer service agreements, user privacy agreements, and other online agreements signed between consumers and operators. Alibaba has a large number of business lines, and the reviewing and updating of such agreements is a huge undertaking.
Currently, the average time needed for a lawyer to manually review an online agreement is about 30 minutes, and due to a large amount of text and stipulations, manual inspections often miss certain details and are not uniform. Is it possible to use AI to perform legal document reviews instead of lawyers?
Although using AI in a system to overcome these issues may seem somewhat intuitive and straightforward, there are still many challenges faced by the system in the way it highlights potential risks and gives recommendations, including:
· Legal vs. natural language
Language tone is an important aspect of natural language processing (NLP) applications. Social networking language, for example, is more colloquial in nature, whereas legal language is much more formal and dry. Legal language usually has its own explicit terms and logic in specific areas, which is quite different from our commonly used natural language. This has led to the inability of the existing research results to be directly applied to the legal domain and illustrates the need to migrate specific areas of language use.
· Differences between technology and business scenarios
Without knowledge of the legal domain and specific laws, even an advanced NLP system would not be able to provide accurate results. Abstracting the needs of the legal domain and combining them with a system is very challenging and requires close cooperation between interdisciplinary and multidisciplinary professionals.
· Lack of annotation data
In the legal field, data is scarce and usually involves sensitive information and trade secrets, making it impossible to be used in multiple domains. Moreover, there is only a small amount of annotation data for certain scenarios.
· High accuracy requirements
The legal field has high requirements on algorithmic indicators. In particular, some scenarios have strict requirements on the recall rate of the algorithm, as the omission of key information may cause significant legal risks. In addition, many legal scenarios require the high interpretability of the algorithm, which is where it first knows what and then why.
With the identification of these issues and potential barriers, Alibaba created the Intelligent Contract Diagnosis System to analyze legal agreements to reduce inconsistencies and potential legal risks. The main focus of this system is based on two aspects:
· Determine what content should not appear in the agreement (e.g., violations of laws and regulations, infringement of consumer rights, fuzzy expressions, etc.).
· Determine what content was originally intended to appear and give suggestions on how the document should be modified. Recommend missing words.
Intelligent Contract Diagnosis System
When setting up the system, the first step was to establish an industry thesaurus and knowledge map for the legal domain. Only by teaching the system to first understand legal terms, instead of natural language, could it be ensured that the system can be trained in an appropriate way to understand legal concepts. Based on a large number of Alibaba’s online agreements, contracts, lawsuits, and other legal documents, phrase mining adopts large-scale unsupervised methods to automatically extract relevant phrases from the documents, such as “including but not limited to”, “power of attorney” and “negligent tort”.
Meanwhile, legal experts formulated business rules based on specific business scenarios. For online agreements, for example, a list of prohibited phrases should be marked out and their corresponding recommendations made available for machine learning. For instance, the recommendation for “The announcement is issued with immediate effect” is “It will come into effect 7 days after the announcement”. The majority of the input legal rules can be technically resolved into the points on the knowledge map and converted into a vector representation that can be processed by a computer.
Vector Representation of Words
A general word vector based on large-scale training is useful for almost all NLP tasks. Regarding the legal domain, due to its specific features, Alibaba has added word vectors based on abundant documentation making the learned word vectors more prominent for the legal domain, as shown below.
The context-based word vector is essentially the word vector of the language model. Word vectors are not only a function of the word itself, but also a function of other words in the sentence and the position of the sequence. General word vector models are mostly trained via Word2Vec or GloVe models. A recent study has found that context-based word vectors ELMo can further enhance multiple NLP tasks.
In the legal domain, Alibaba uses Embeddings from Language Models, or ELMo, to obtain word vectors, as recent work shows that it can further enhance multiple NLP tasks, which improved the performance of the model.
Besides the vector representation of words, another important aspect is the annotation data used in the AI system.
Generating Annotation Data
Annotation data is one of the most important elements of machine learning, and models can only be trained effectively enough if there is sufficient annotation data. However, in the legal domain, the acquisition of annotation data is expensive and requires legal professionals to annotate them.
To achieve the trade-off between efficiency and cost, the tech team at Alibaba first built an automatic annotating service based on the rules and knowledge maps that were input by legal professionals, which can automatically annotate stock data.
In this service, keywords are used to automatically generate annotation data and are then replaced with similar words. For instance, in the expression “The announcement is issued with immediate effect”, the word “immediate” can be replaced with “instant” and other similar words, which can generate a large amount of annotation data.
Although certain problems can be solved in the use of automation, they cannot be used to improve the generalization ability of the model. Therefore, active learning and acknowledging that rules are limited are very important aspects in realizing the effectiveness of the model, making manual annotation a vital requirement.
In order to reduce the cost of manual annotation, active learning methods can be adopted for common scenarios, while the manual approach is still required for more uncertain cases. Alibaba’s active learning model for generating annotation data is outlined above.
Recent advances in AI and deep learning have led to applications in text categorization technology. Currently, the most common solutions are based on recurrent neural network (RNN) sequence models, convolutional neural network (CNN) -based models, and various evolved variants, such as the use of pre-trained word embedding with the attention mechanism. An outline of how to integrate multiple model types in this way is illustrated below.
In the vertical legal domain, Alibaba used the ELMo model to construct the word vector with its features as the input of the model. In order to review online agreements, the C-GRU model was developed as a combination of both CNN and RNN. This model not only fully captures the relationship between core and surrounding words, but also solves the dependency problem of long sentences.
Although the deep learning model can better solve the classification problem of violation statements, it still remains difficult to interpret. The intelligent review of an online agreement needs to be able to identify irregular clauses, locate the specific words that cause violations, and find out how to change the terms as recommended.
Therefore, Alibaba adopted the deep learning model for a high recall rate and to detect all clauses that may cause violations, as shown below. Then the clauses are parsed through syntactic analysis and rules, the specific position of violations is located, and a recommendation is made. Two advantages of this approach include using deep learning to increase the recall rate and using rules for precise positioning.
The online agreement AI diagnostic system not only greatly improves the efficiency of reviewing agreements, but does so with high levels of accuracy. The average accuracy rate is over 94%, which is the equivalent of saving 130 working days per year.
In recent years, AI technology, deep learning, and natural language processing have made tremendous progress and have also begun to emerge in the field of legal intelligence, attracting widespread attention from both the academic and industrial communities. Smart contract diagnosis is Alibaba’s first step towards an intelligent legal domain.
In terms of technology, Alibaba’s Machine Intelligence Technology (MIT) will be further utilized to strengthen the research and exploration of industry knowledge mapping, machine reading comprehension, and information extraction technology applied in the legal domain. This could potentially then be used to serve diversified legal activities through the reservation of basic data resources in the legal field and the construction of natural language processing platforms.
In addition to natural language processing, moving forward, Alibaba will also invest more in audio and video technologies, such as image recognition, optical character recognition (OCR), handwritten character recognition, and automatic speech recognition (ASR).
These could deal with different types of legal materials and solve the problem of multi-source information input in the upstream of natural language processing. Alibaba’s ultimate goal is to build full-link and full-capacity legal AI capabilities to serve the needs of general users, lawyers, courts, and other legal practitioners.
For a full technical description of the Intelligent Contract Diagnosis technology, please follow this link.
 Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. DEEP CONTEXTUALIZED WORD REPRESENTATIONS. ICLR’18.
 Settles, Burr (2010), “Active Learning Literature Survey”, Computer Sciences Technical Report 1648. University of Wisconsin–Madison, retrieved.
(Original article by Liu Min刘敏 and Tian Ou田欧)