Question-Answering Model Fine-Tuned for Portuguese
This post introduces qaptnet, an instance of Google’s BERT model, fine-tuned for performing question-answering tasks in Portuguese text.
The model is trained on a dataset of contexts, questions, and corresponding answers written in Portuguese. The output of the model is the span of text in the context that answers the proposed question. For example. given the following
context = 'Arquitetonicamente, a escola tem um caráter católico. No topo da cúpula de ouro do edifício principal é uma estátua de ouro da Virgem Maria. Imediatamente em frente ao edifício principal e de frente para ele, é uma estátua de cobre de Cristo com os braços erguidos com a lenda Venite Ad Me Omnes. Ao lado do edifício principal é a Basílica do Sagrado Coração. Imediatamente atrás da basílica é a Gruta, um lugar mariano de oração e reflexão. É uma réplica da gruta em Lourdes, na França, onde a Virgem Maria supostamente apareceu a Santa Bernadette Soubirous em 1858. No final da unidade principal (e em uma linha direta que liga através de 3 estátuas e da Cúpula de Ouro), é um estátua de pedra simples e moderna de Maria.'
And the following
question = 'A quem a Virgem Maria supostamente apareceu em 1858 em Lourdes, na França?'
The model infers the span on the context that yields the answer to the question:
>>> ptnet.query(context = context, question = question)
'Santa Bernadette Soubirous'
Bidirectional Encoder Representations from Transformers (BERT) is a state-of-the-art model for many NLP tasks, e.g. classification, named entity recognition, and question-answering. In a nutshell the model is divided in two major sections: 1) a pre-trained transformer that acts as a language model, and 2) a section fine-tuned for the task at hand, as illustrated by the following figure.
The input to the model in the context and the question, and the output is the starting and ending token of the span of the answer in the context. Google provides different flavors of pre-trained weights for the model, mainly single language or multilingual, and with different number of hidden layers. For training qaptnet the multilingual, case sensitive, 12 hidden layers, weights were used to bootstrap the model, and the train part of the dataset described in the next section was used to fine-tune the model. Implementation, training and deployment is done using the PyTorch-Transformers package.
To train the model for Portuguese a collection of questions and contexts, with the corresponding answers. To build this dataset we translated the SQuAD dataset from English to Portuguese using the Google Translate API. The original dataset contains a collection of around 60000 pairs question answer. of around 400 topics from Wikipedia. In general the automatic translation result was good enough for a first version, mainly because the dataset snippets of text are small enough to yield good results from an automatic translation and the type of texts (Wikipedia articles) is not very complex or abstract.
The final version of the model achieves around 50% accuracy on the development part of the dataset, which is interesting enough for a first exploratory attempt. Further improvements to the dataset translation, and fine-tuning the model hyper-parameters, can easily improve the accuracy score. PLN.pt provides access to the current version of model via it’s API for easy access. Examples and how to query the API are available from the qaptnet repository.