Processing and analyzing documents is often a tedious and labor-intensive process for many enterprises. The goal of Euclid-Annotate is to provide an easy tool that automatically annotates important parts of the documents with domain-specific labels, accelerating the process of extracting relevant knowledge from large sets of text.
One of the largest components of Euclid-Annotate is Named-Entity Recognition (NER), which is the ability to identify specific aspects of a larger body of text, including but not limited to: named entities (real-world objects such as people, places, and organizations), specific vocabulary, and various actions. NER is the central goal of Euclid-Annotate and allows the user to label specific entities in the document that they deem relevant or useful to their ultimate goal. For our model, we chose a large public legal case file library since legal cases contain a lot of domain-specific terminologies that we hoped to be able to identify. So for our particular application, we use NER to label the judge, attorney, plaintiff, defendant, etc. However, the ultimate goal of Euclid-Annotate is to be able to function across a large spectrum of data and academic fields. While SpaCy already has a built-in NER function available, we felt that it was too generic and inflexible for our purposes. Instead, we opted to train our own NER model using other SpaCy functionalities and supervised machine learning techniques. In order to do this, we worked with a database of public case files from the state of Illinois.
The process we took to build our models can be broken down into three steps: pre-labeling, manual correction, and training. Our pre-labeling functions consisted of looping through the entire case file folder, splitting each into a list of words, and running every word through SpaCy’s original NER function. The result was very rudimentary; only a fraction of the critical entities in the document had labels because that was the extent of SpaCy’s default list of entity labels. We followed this by manually labeling over 100 individual case files with a new list of labels, not only including the defaults such as “PERSON” and “GPE” but also adding more document-specific labels such as “DEFENDANT,” “CRIME,” and “VERDICT.” Finally, we wrote a brief script to train our SpaCy model with our numerous manually labeled files, adding new labels to the pipe if they did not already exist in the SpaCy NER library. Our biggest challenge came with accurately labeling cases and laws, which were often very long (ex: “Teutonia Insurance Co. v. Mueller, 77 Ill. 22”) and difficult for our model to detect. This transfer learning process of training and retraining models (all starting from SpaCy’s default English models) allowed us with little data, rapidly train performant domain-specific NER models. Below is a visualization of the learning process for language models.
Because the documents in different projects will have a range of domains, they will also require different NER models attuned to their individual topics. One naive way to do this would be to load a default English NER model (such as the models provided by SpaCy) and train it on some manually annotated files in the project folder. However, it is more efficient to start from a relevant model that has already been trained. The service could automatically label entities for users, who then simply have to correct it or add more labels that they deem important for their purposes. Therefore, we decided to add the functionality to automatically recognize which pre-trained NER model is most relevant to this project and then suggest it to the user. Behind the scenes, this requires the ability to classify documents by a certain domain or use. This was the primary necessity of a document classification model. In addition, we envisioned adding a tool in our application that allowed users to train the service to automatically separate their documents into categories, given some training data.
Document classification in natural language processing requires several steps. Any classification scheme in machine learning requires input data to be a collection of numerical vectors. This presents the largest challenge in the classification problem; finding a way to represent a document, or a collection of text, as a list of numbers. This feature vector, as it is called, would need to convey both the content of the text and its characteristics. It can then be passed into a classifier model. The document classification process can be visualized as such:
In addition to the above image, a vital part of the pipeline includes pre-processing the input. To do this, we converted our documents into strings: tokenizing, running a script to remove stop-words (uninformative fillers such as “the” and “and”), stemming, words, and removing words that either occurred too rarely (in only a couple documents) or too often (in more than 70% of documents in a certain corpus). For these various steps, we used NLTK and Gensim packages.
In our project, we explored several “feature extractors”, or methods of converting text to vectors. The simplest one, bag-of-words, disregards the grammatical structure of the text and the ordering of the words. Each number in the resulting vector is associated with a certain word that exists in the corpus (collection) of documents, and the value corresponds to the frequency with which that word occurs in the document. Term Frequency-Inverse Document Frequency (TF-IDF) uses a similar ideology to weight words by how frequently they occur. The formula below, where wi,j represents the TF-IDF score for word i in document j, shows how the values in the feature vector are calculated:
In addition, SpaCy has built-in doc2vec functionality which allows the user to pass in a string of text and it converts it to a numerical embedding while still conserving its semantic meaning. SpaCy’s individual word embeddings (word2vec) can be used to create document embeddings by calculating the average of all these word vectors or aggregating them using TF-IDF weights. Another method is to look at n-grams, or n-sized sets of words in the text, vectorize each n-gram and aggregate them. Finally, we tried using Latent Dirichlet Allocation, a topic modeling algorithm that uses unsupervised learning to split the texts into different “topics”, or clusters, and then uses the scores of how well a document fits in each topic as a feature vector.
Reading about each method and then testing them, we concluded that one of the simpler ones, TF-IDF, carried the most semantic meaning about the original vectors for our purposes. The problem was that these resulting vectors were large and sparse. Since there is a value for each word in our whole corpus, the resulting feature vector can have tens of thousands of values, many of which are zeroes. This is true even though we removed a majority of the words in our pre-processing steps. To reduce the dimensions of these vectors, and in turn condense the feature vectors, we explored Principal Component Analysis (PCA) and singular value decomposition (SVD). These dimensionality reduction techniques yielded mediocre results in our classifiers and so we decided to roll with the original sparse vectors as input.
Eventually, we found that the best models for our classifiers were Naive Bayes and Support Vector Machines (SVM). This is mainly because Naive Bayes and SVM usually perform best with small training sets and we assume many of the datasets that this tool will be deployed on will be small (a relative term). After extensive testing and hyperparameter tuning, we settled on using a Multinomial Naive Bayes classification model. Multinomial Naive Bayes uses a multinomial distribution as the prior for the feature vectors, in contrast to Gaussian Naive Bayes, which uses a multivariate normal distribution.
A potential issue that could arise from our classification model is that since we depend on input data from different users, the number of documents in our training set from different classes could vary widely, creating an imbalanced dataset. This is dangerous because this would lead to the classification model predicting more common classes too much. Common ways to solve this include over- and under-sampling, but we implemented a custom synthetic minority over-sampling technique (SMOTE) to ensure a somewhat balanced dataset.
Therefore, for an input text document, the classification pipeline is as follows:
- Convert the document to a string by extracting the text
- Use the NLTK package to tokenize the text (split into words), throw out uninformative words, stem the words, and then lemmatize them.
- Use a Term Frequency — Inverse Document Frequency model to create a feature vector from the list of pre-processed words
- Pass the feature vector through our Multinomial Naive Bayes model
We, of course, hope that after our internship ends, the tool we have spent the summer on ends up being used at some point and that someone or some organization is able to benefit from our efforts. The four of us came in with different experiences and qualifications, but we can all say without a doubt that this was a tremendous learning experience.
This blog was written by several interns in The Hive’s 2019 Summer Internship Program. To find out more about The Hive’s internship program please contact firstname.lastname@example.org.