nerman: Named Entity Recognition System Built on AllenNLP and Optuna
This article is a translation of the Japanese blog post authored by Makoto Hiramatsu of Cookpad Inc.
In this article, I introduce
nerman, a named entity recognition system from recipe texts. The word nerman is derived from
ner (Named Entity Recognition) +
man. It is a system that automatically extracts cooking terminology from recipes posted on Cookpad, which is created with a combination of AllenNLP and Optuna. (Some parts of the actual codes are simplified, as it is difficult to explain everything about codes.)
Automatic extraction of cooking terminology
Various cooking terms appear in cuisine recipes. Not only ingredients and cookware but also cooking actions and ingredient quantities can be considered as cooking terms. If we consider the cooking action of “cutting“, there are various ways of cutting such as “cutting into chunks”, “slicing in rounds”, “chopping into fine pieces”, and so on, depending on the cooking purpose. If we can extract these culinary terms from recipes, we can apply them to tasks such as extracting information from recipes and responding to questions.
For automatic extraction of cooking terms, we use machine learning in this case. There is a task called named entity recognition among the several tasks of natural language processing.
Named entity recognition is a task to extract unique expressions such as the names of people, places, and organizations from natural language sentences (often the target documents are newspapers and such).
This task can be formulated as a problem called serial labeling. In named entity recognition using serial labeling, the input sentences are divided into words, and each word is tagged with a name entity tag. Extracting tagged word strings will result in intrinsic expressions.
This time, instead of person name and place name, we train models by considering names of ingredients, cooking utensils, and cooking actions as intrinsic expressions. Detailed definitions of unique expression tags are described in the next chapter.
Training machine learning models require training data. In Cookpad, we constructed the annotated corpus with experts for developing annotation guidelines and building a corpus.
Extraction of intrinsic expressions from recipes is also being researched at Mori Laboratory of Kyoto University (See the paper. A PDF file will be opened). With reference to the peculiar expression tags defined in the above study, we defined the following intrinsic expression tags as targets for extraction in accordance with the actual use cases in Cookpad.
Based on this definition, we assigned the named entity to 500 recipes chosen posted on Cookpad. This data is named Cookpad Parsed Corpus and is managed in an internal GitHub repository. Additionally, data that was pre-processed (e.g., reformatted) for usage in the machine learning model has been uploaded onto S3.
We are currently working on compiling a paper as an output on the Cookpad Parsed Corpus. The paper we wrote has been selected for the Linguistic Annotation Workshop (LAW), a workshop for research on language resources held at the 28th International Conference on Computational Linguistics (COLING), an international conference on natural language processing. 🎊
The title of our paper is as follows:
Cookpad Parsed Corpus: Linguistic Annotations of Japanese Recipes
Jun Harashima and Makoto Hiramatsu
The recipes contained in the Cookpad Parsed Corpus are given information on morphemes and dependencies as well as intrinsic expressions, and we are preparing to make them available to researchers who currently belong to universities and research institutions.
Preparation: Named Entity Recognition Model Using AllenNLP
In nerman, the models are implemented using AllenNLP.
An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide…
AllenNLP is a framework for NLP developed by the Allen Institute for Artificial Intelligence (AllenAI) that offers a convenient library to simply create neural networks for NLP based on the latest machine learning methods. You can install AllenNLP in
> pip install allennlp
In AllenNLP, definitions of models and learning settings are described in Jsonnet format files. The following is a configuration file (
config.jsonnet) used in the learning of the named entity recognition model (Here, the BiLSTM-CRF is adopted for the model).
The model, data, and learning settings are specified respectively. AllenNLP can specify a URL as well as a local file path as a path to a dataset. Currently, AllenNLP seems to be compatible with
https, https, and s3 schemas. (The code that was read as above is somewhere around here). In nerman, the Cookpad Parsed Corpus training data and validation data are placed on S3 (after the pre-processed procedures for being utilized in nerman) and the URLs are specified for both of them in
validation_data_path on S3.
AllenNLP provides us with components required for loading datasets of well-known tasks for NLP. However, if you want to utilize the dataset that you built by yourself as in this case, you need to create a class (a dataset reader) to parse the dataset on your own. `cookpad2020` is a dataset reader for loading the Cookpad Parsed Corpus. The official tutorial explains how to create a dataset reader, so please refer to it if you would like to know the details.
After a configuration file is created, the learning process starts with the execution of a command such as
allennlp train config.jsonnet --serialization-dir result . All the information required for learning is packaged in a configuration file, and one of the AllenNLP features is its ease of experiment management. I will explain about `serialization-dir` later in this article.
The `allennlp` command includes very useful subcommands like `allennlp predict` and `allennlp evaluate` that I will not introduce in this article. For more details, please refer to the official documents.
The entire structure of nerman
The following chart indicates an overview of nerman:
The system consists of three major batches and the roles of the batches are as follows:
(1) Hyperparameter optimization
(2) Training models
(3) Named entity recognition (prediction) from real data (recipes)
In this paper, I will explain in the following order:
training models =>
prediction on read data =>
Training batch of models executes the shell scripts as follows:
As described in the preparatory chapter, we will use the
allennlp train command to train models. The directory specified by
--serialization-dir stores the archive of models (in the tar.gz format) and the archive file contains the model weights, as well as data such as standard output, standard error output, and metrics of the trained models.
Once the training is done, the model archive generated by the `allennlp train` command, and the metrics are uploaded to S3 (the archive file stores the model weights and others, and you can instantly restore the model with this file). In addition, the metrics file can be uploaded at the same time to enable tracking of the performance of each model.
A training job is executed on EC2 instances. In this case, the dataset is relatively small (total data = 500 recipes) and the BiLSTM-CRF network is not so large. Therefore, learning can be performed on instances that are approximately as large as normal batch jobs at a scale. As the execution environment does not require resources such as GPU or large memories, we could follow the normal batch development flow. This has enabled us to build learning batches while reducing the development cost for the infrastructure environment by utilizing our accumulated in-house knowledge and expertise in batch operations.
All batches of nerman are built, presupposing spot instances. Spot instances require cheaper casts than normal instances, however, they have the possibility of having a forced shutdown during execution (called spot interruption) in return. Even if the process of training models is forced to terminate, you can still retry it, and you can reduce the cost by utilizing spot instances if the training time is not too long.
(However, the longer the learning time may take, the more likely you will encounter spot interruptions. Please note that it may cost more if the overall execution time, including retries, gets too long compared to the execution time on a normal instance.)
Predictions on real data
We run the following shell scripts to perform the prediction:
Prediction batches load models generated by learning batches and analyzes recipes to which names entities have not yet been assigned. Moreover, prediction batches support parallel executions. Over 3.4 million recipes have been posted to the Cookpad, and it is not easy to analyze all of them at one time. For this reason, we divided the recipes into multiple groups and analyzed each group in parallel.
We can specify target recipes for analysis with
LAST_RECIPE_IDX and set parallel numbers with the environment variable named
KUROKO2_PARALLEL_FORK_SIZE. The processes in which parallel executions have been done are preset to be given the variable (
KUROKO2_PARALLEL_FORK_INDEX), and this variable identifies which a number of processes of parallel execution were given to itself. The parallelization process can be achieved by way of utilizing the parallel execution function parallel_fork that belongs to kuroko2, a job management system used in our firm.
custom-predict command divides the target recipes using the variables that we have defined as above and extracts the named entity by means of AllenNLP models. AllenNLP allows you to register your own subcommands, and thus all the processing can be executed from the
allennlp command. You can define a subcommand by generating a Python script (
predict_custom.py) as follows:
model_path specifies a path for archive files for models. The path to the archive file is handed to the method called
load_archive method, provided by AllenNLP, enables you to easily restore a trained model that has been saved. Also,
load_archive supports S3 schemas as well as dataset paths, so you can use the same path that you have specified for the uploaded destination in the learning batch (the official document for
load_archive is here).
We use a mechanism called
Predictor implemented in AllenNLP to input strings in a model. The official documentation is here. It inherits the useful
SentenceTaggerPredictor class for handling the prediction results of a series labeling model and defines the following
KonohaSentenceTaggerPredictor class. If you enter a string that you want to analyze into the
predict method, it will output the prediction results of the model.
In nerman, we use konoha, a Japanese language processing tool, to process recipe data in Japanese.
KonohaTokenizer is the AllenNLP integration function provided by Konoha. It receives strings of Japanese characters, performs segmentation or morphological analysis, and outputs a sequence of AllenNLP tokens. We adopt MeCab as a morphological analyzer and
mecab-ipadic as a dictionary.
Konoha is a Python library for providing easy-to-use integrated interface of various Japanese tokenziers, which enables…
Next, you will import the created module with
__init__.py. In this example, we have placed
custom_predict.py in the directory named
nerman/commands. Therefore, we edit
nerman/commands/__init__.py respectively as follows:
Once commands are defined and imported, you can create a file (
.allennlp_plugins) in the repository root for the
allennlp command to actually recognize the subcommand.
Following the above operation, subcommands become ready to be executed with the
Then you can confirm if the command you created with executing
allennlp --helpis now available (or, you can use
allennlp test-install command).
The obtained prediction results will be saved in a CSV format file and uploaded to S3 after the prediction is completed.
Next, you input the prediction results (that you have uploaded to S3) into the database. The data is eventually placed on Amazon Redshift (hereafter, Redshift), but we use an architecture that goes through Amazon Aurora (hereafter, Aurora). This is for the purpose of utilizing the
LOAD DATA FROM S3 statement feature in Aurora. The
LOAD DATA FROM S3 statement can be used in SQL queries such as the following:
By executing this query, you can directly import a CSV file (that you have uploaded in S3) into Amazon Aurora. For details about
LOAD DATA FROM S3 , please refer to the official AWS document. This is very useful when loading large scale data into a database as it can eliminate the time and effort of adjusting both batch size and commit timing.
The prediction results fed into the Aurora database are periodically imported into Redshift via an in-house system called pipelined-migrator. With the use of
pipelined-migrator, you can retrieve data from Aurora to Redshift with only a few steps of configuration on the administration screen. Thus, a labor-saving data input flow combining loading from S3 and a pipelined-migrator has been realized.
An alternative way to provide staff with analysis results is to prepare a prediction API without using a database. The aim of this task is to “automatically extract cooking terms from recipes that have already been posted”, which is computable in batch processing beforehand. For this reason, we adopted a policy of performing predictions in batch processing without preparing an API server.
We also thought that we would like our non-engineering staff to try using the prediction results as well. Since most of the staff members of Cookpad other than engineers can also write SQL, the policy of storing the prediction results in a queryable form in the database was a cost-efficient option. Following is an example of a query that uses the prediction results:
By executing this query on Redshift, we can retrieve the list of cooking tools that appear in the recipes.
Distributed hyperparameter optimization using Optuna
Finally, I will explain hyperparameter optimization.
Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It…
Nerman performs hyperparameter optimization using Optuna. Optuna is a hyperparameter optimization library developed by Preferred Networks(PFN). Installation of Optuna can be completed by executing `pip install optuna` in a terminal.
Optuna provides a backend engine (RDB or Redis) that can be connected from each instance that is used in storage, which realizes hyperparameter optimization in a distributed environment utilizing multiple instances. (This storage is applied for Optuna to store the optimized results, which abstracts RDB, Redis, and others). When performing cross-instance distributed optimization, either MySQL or PostgreSQL is recommended as a storage backend engine (Redis is also available as an experimental feature). For details, please refer to the official documentation. In this case, we use MySQL (Aurora) as a storage device.
Optuna includes an integration module for AllenNLP. However, you need to write scripts in Python to perform the optimization process by yourself when using this integration module. Thus, we developed a tool called allennlp-optuna to collaborate more smoothly with AllenNLP. After the installation of `allennlp-optuna`, users will be able to perform hyperparameter optimization using Optuna with the command named `allennlp tune`. This command is highly compatible with the `allennlp train` command, which allows users who are already familiar with AllenNLP to smoothly attempt hyperparameter optimization.
allennlp-optuna is AllenNLP plugin for hyperparameter optimization using Optuna. Supported environments AllenNLP…
First of all, you need to install
pip install allennlp-optuna for
allennlp tune command. Then, you will edit
.allennlp_plugins as follows:
Then, the installation is successful if you can confirm the
tune commands as follows when you execute the
allennlp --help command:
If you can see
best-params, your installation is successful!
Now allennlp-optuna has been installed successfully. Next, I will explain the necessary preparations for using
Modification of configuration files
First, we rewrite
config.jsonnet that we have created in the preparation chapter as follows:
The hyperparameters that we wish to optimize are adjusted as variables such as
local lr = std.parseJson(std.extVar('lr')). The return value of
std.extVar is a character string. Since hyperparameters in machine learning models are most likely to be integers or floats, they require type conversion.
For integer value, you can use
std.parseInt .Jsonnet doesn’t provide
std.parseFloat and you have to use
std.parseJson for float values.
Definition of search space
Next, I will define search space for hyperparameters. In
allennlp-optuna, the search space is defined in the JSON file (
hparams.json) as follows:
In this example, the learning rate and the dropout ratio are the targets for optimization. We set the upper and lower limit values for each rate. Note that the learning rate is set as
"log": true for enabling sampling from a logarithmic scale distribution.
The optimization batch executes shell scripts like the following:
Distributed optimization of hyperparameters can be carried out by way of running this command on multiple instances. , The optimization progress is shared among multiple instances by specifying the option named
--skip-if-exists. Optuna usually creates a new experimental environment (called
study) for every single run and performs a hyperparameter search. At this time, if a study with the same name already exists in the storage, it indicates that an error has occurred. However, if you enable
--skip-if-exists, in case there is a study with the same name in the storage, the study in question will be loaded and the search will resume from the middle. This mechanism allows optimization sharing a study only by enabling
--skip-if-exists in multiple instances for starting searches. The optimization batch performs searches (at most 20 times) in a given time (the value set by
2 hours) with the use of the above scripts.
Thus, thanks to Optuna’s remote storage function, we were able to realize distributed optimization by simply executing the same command on multiple instances! For more details about Optuna’s distributed hyperparameter optimization or more advanced usage of Optuna, check out the Optuna developer’s commentary document.
The shell script utilizes the
retrain command provided by
allennlp retrain command retrieves optimization results from storage and passes the obtained hyperparameters to AllenNLP for performing model training. We can see that the
retrain command provides almost the same interface as the
train command, which is similar to
The metrics of the retrained models are listed below:
Compared to the model trained in the chapter of
training models, the F-value in the test data (
test_f1-measure-overall) rose from
86.0, improving the performance by
3.3 points. If you approximately specify search spaces for hyperparameters and let Optuna optimize them, you will get well-performing hyperparameters, which is convenient.
Optuna not only optimizes hyperparameters but also provides powerful experiment management features including the function to visualize the metrics changes during the optimization process and hyperparameter importance and the output function of the optimized results on a pandas DataFrame.
In this article, we introduced nerman, a named entity recognition system built with AllenNLP and Optuna. Nerman enables model training and real-world data application using AllenNLP, reduction in time and effort for data input using Amazon Aurora, and scalable hyperparameter search using Optuna. We hope that this article will be helpful to you as an example of a machine learning system using AllenNLP and Optuna.