In the field of natural language processing (NLP), a lot of recent groundbreaking work has been done with massive language models that exhibit high degrees of understanding of the language tasks they are deployed in. These models are in use across many different types of technology solutions- the vast majority of technologies that works with language meaning or interpretation can benefit from a model like this.
The success of these models has revolutionized the field and played a huge part in the development of NLP-enabled tech solutions. One of these models, BERT, was developed by Google and was one of the first to truly redefine the field. The architecture and training tasks it relies on are what makes it uniquely effective.
The BERT model is designed using an architecture called Transformers, and trained on two tasks, masked language modeling and next-sentence classification. Masked language modeling, in short, inserts blanks into sentences and BERT is tasked with filling the blanks correctly — adjusting the weights of the transformer network to maximize correct answers. Next-sentence classification asks BERT to predict whether one sentence occurs after another given sentence.
Both of these tasks are fairly complex tasks to solve as there are not only a couple hundred thousand tokens that could possibly fill in the blank for the masked language task, but there is also a lot of information, much of which is contextual, that determines whether a sentence follows another. To allow the model to learn enough about a given language, the model is given a huge number of parameters to adjust, nearly 345 million for a typical BERT model.
On top of the sheer size of the model, the training data provided on which it is to perform the training tasks needs to be a comprehensive representation of the English language which should allow BERT to learn contextual information about as many word tokens and combinations of word tokens as possible. This training data is typically a huge dataset, and for the original BERT pre-trained models consists of all of the English-language Wikipedia (2.5 billion words) and all of the BooksCorpus set (800 million words).
The original has general knowledge of the English language that includes a robust approximation of meanings of words and sentences that is well-suited for use by other machine learning models and systems. It is highly successful at many tasks — but there are ways to create a BERT model that is even better. An intuitive way that one might improve a BERT model for use on a specific task would be by increased exposure during training to a certain kind of data, so it learns to produce more robust representations for that type. For example, someone training a BERT model for specific use with scientific papers might include large corpora of papers in a train set for the model, or perhaps even exclusively use papers for training. Retraining can be a computationally expensive affair, but the benefits are more often than not worth the cost.
At Resultid, we use a model of BERT that we have specialized and tested on our proprietary dataset to ensure that results that our platform and services generate using the model are as robust as possible in the specialized language used in documents that relate to cutting edge technology. We process data about companies, grant filings, patents, and more — some of which can be incredibly specific and technical — so it is vital that our model be well-suited to handle that kind of information.
BERT is only the beginning however — the first truly successful massive pre-trained language model that has cemented a place for such models in the industry. As innovators and researchers design newer, bigger models, we look forward to seeing what the next big advance will be.
BERT Paper (Devlin et al.): https://arxiv.org/abs/1810.04805