Frame based Metric for Automatic Machine Translation Evaluation : My Journey in GSoC 2019

8 min readAug 25, 2019

This post is a brief introduction to the project that I was assigned during my tenure as a student developer with FrameNet Brazil in Google Summer of Code (GSoC) 2019 and a summary of the work that was accomplished during this period.

Overview of the task

Machine Translation is one of the most essential tasks of Natural Language Processing and a much researched one.There are various metrics that are available to evaluate the quality of translation but most are obtained by computing the similarity between an MT hypothesis and a reference translation based on character N-grams or word N-grams. Using the relations between frames given in the Berkeley FrameNet Data Release 1.7[1], in this task, an automated metric system for machine translation was developed, which is intended to measure the frame distance between sentence pairs in two languages.

Figure 1 : Example of frame annotated parallel sentence pairs

Annotated Corpus

The frame annotated transcripts of the TED Talk “Do schools kill creativity?” [2] by Sir Ken Robinson was used as the annotated corpus for this task. It is a parallel corpus available in multiple languages, but for our purpose, only English, German and Portuguese was considered.

Figure 2 : Pie Chart showing data distribution in terms of language pairs. (Deutsche refers to German)

The data set consisted of 282 English-Portuguese sentence pairs and 55 English-German sentence pairs, i.e. a total of 337 sentence pairs (as shown in the figure above). Each of these sentence pairs required a manually assigned score denoting the quality of the translation. We only had ~30 such sentence pairs for each language pair! That posed to be the greatest challenge in this task — to produce meaningful results with very limited data.

Figure 3: Bar chart showing the frequency distribution of scored data (~60 sentence pairs in total). Since the original data doesn’t have a Gaussian Distribution, transformation functions were put to test to achieve a Gaussian Distribution, but we did not achieve one. Hence we proceeded with the original data set that was given to us.

Embeddings : The Real Power lies within thee.

To evaluate the quality of machine translation based on the frame distance between the source and target language sentences, the first step was to carefully handcraft features and encode those features using appropriate embedding networks. We considered the following features for this task : Frames, Frame Elements, Lexical units triggering corresponding frames, Lexical units triggering corresponding frame elements, and sentence representations.

Figure 4: A heatmap showing the cosine similarity of sentence pairs at a frame and frame element level for 28 Portuguese and 29 English Sentences. Feature embedding used in this comparison is Fasttext. One can see that the diagonal has the highest cosine similarity denoting that true sentence pairs are being identified correctly. The sentence pairs which are absolutely black in spite of being true alignments of each other, are the sentences which have no frames or frame elements annotated in them.(X-axis- English sentence ids, Y-axis- Portuguese Sentence ids)

In the first phase, all the information except sentences were encoded using pretrained Fasttext embeddings[3]. FastText embeddings were used as it offers word embeddings for multiple languages and provides word embeddings at a subword level. Subword information helps in eradicating the problem of out of vocabulary words, as any word embedding in FastText is essentially an agglomeration of embeddings of the word’s n-grams.

To accurately measure the semantic distance between parallel sentences in different languages, the importance to capture the contextual information of each word was realized. Hence, in the next phase of experiments, the lexical units and the sentences were embedded using Bidirectional Encoder Representations from Transformers (BERT) [4]. With the help of BERT, the contextual semantic information was aptly captured. Multilingual cased pretrained embeddings were used, hence mapping all the information in a common vector space. This was an important step as it leveraged the use of all the sentence pairs together irrespective of which source-target language pair it belonged to, thus increasing the data set to some extent.

Figure 5: The above figures show the variation of Pearson Correlation Coefficient and Root Mean Squared Error (RMSE) with respect to features. 61 sentence pairs were used with BERT multilingual feature embeddings and were trained and tested using a simple linear regression model. We can observe that while the RMSE is comparable for all the different feature combinations, the increase of features show a significant improvement in the Pearson Correlation Coefficient. Hence, in the final experiments all the features have been used to determine the metric.

“A small step is a giant leap” : the Learning Phase

Baseline :
A simple linear regression model was used as the baseline model. This model was trained in different feature and data set settings to identify the optimum setting before proceeding to advanced learning procedures.
Before shifting to BERT, individual models for each language pair were experimented with. With only 30 sentence pairs, the models obviously performed pretty poorly. As it can be observed in Figure 2, the scores range mainly from 0.6 to 1.0. An attempt to data augmentation was made by introducing negative samples by generating random sentence pairs which do not identify as correctly aligned sentence pairs for each language pair and were assigned a score of 0. Although that increased the number of samples, the performance decreased rather than increased. Hence, we combined the sentence pairs to get a total of 61 sentence pairs and performed further experiments using that.

Figure 6: The above figures show the variation of Pearson Correlation Coefficient and Root Mean Squared Error (RMSE) with respect to different datasets and feature embeddings using a simple linear regression model. (ende :English-German sentence pairs, enpt : English-Portuguese sentence pairs)

From the above figures (Figure 6), taking into consideration a good trade-off between RMSE and Pearson Correlation values, we can say that a combined data set of 61 annotated sentence pairs from both language pairs with BERT feature embeddings is an optimum setting to proceed with.
Semi Supervised Approach :
Using the optimized settings that were inferred from the baseline models, a semi supervised approach was adopted to increase the training sample size and make a better learning model. In this phase of experiments, the unscored data is divided into n no. of chunks . Each chunk is scored using a model that was already trained on the scored data. Each newly scored chunk is then added to the existing training set and is cross-validated using 5-fold cross validation separately. Out of the n chunks, whichever recorded the lowest mean squared error on being added to the existing training set is then actually augmented with the existing training set. This entire experiment is repeated with the remaining n-1 chunks and the augmented data set, acting as the existing training set each time, until n = 0. In our case, n = 6, with each chunk containing 46 samples.

Results & Comparisons

Using the advanced approach, I experimented with 5 models with different hyper parameter tuning, namely :

simple linear regressor
support vector regressor
random forest regressor
multi layer perceptron regressor

The models were tested in the baseline setting with and without the advanced approach. The metrics used to evaluate the experiments were root mean squared error, a standard metric for regression tasks; and Pearson Correlation Coefficient to show the correlation between human annotated scores and machine assigned scores. The results can be observed as follows :

Table 1 : Comparison of Baseline Models with the Models trained using the Semi Supervised Approach. Support Vector Regressor appears to be the best performing model in this experiment.

Figure 7 : The above figure presents a comparison between models trained using with and without the semi supervised approach. The results are comparable using both approaches except for in the case of Random Forest Regressor model, where the model performs worse without iterative learning. [LR : Linear Regressor, SVR : Support Vector Regressor, MLP_R : Multi Layer Perceptron Regressor, RF_R : Random Forest Regressor]

Error Analysis & Discussion

To deal with the data challenge, I came up with this iterative learning approach. I was inspired from the work done in the domain of semi supervised label propagation for low resource languages.[5] This method has worked well in many classification tasks, like sequence labeling, POS tagging[6], word sense disambiguation, but I did not observe this method being applied in any regression task. This formed the motivation behind using this method.

The major reason why this model was not able to produce better results, is because of the highly biased data distribution as can be observed in Figure 3. The model trained on 40 sentences (61 sentences split to 40 train and 21 test respectively) and learnt to score sentence pairs in the range of ~0.8. Hence in spite of adding more data, the model predicted all scores as 0.80–0.85 and the model did not improve. The model is expected to give better results if the data is well distributed and there are enough number of samples in the actual score range, which is [0,1] in our case.

Another observation that can be made is that more complex the model, worse is the performance. This indicates that the model overfits and is unable to generalize due to very less data. Support Vector Regressor performs the best as it is neither as simple as a linear regressor nor as complex as a random forest regressor and it’s superior results goes on to show that it is the best model for this data set and feature setting.

Future Work

Future work would include increasing the annotated corpus for a better learning of the models. The sample space should be well distributed, preferably with a Gaussian Distribution. One can also include the development of an automated multi lingual frame annotation parser as a required future task, so that instead of spending several man hours, we can have more annotated data easily.

This is a multilingual machine translation evaluation model that has been developed. While testing has only been done on English-Portuguese and English-German sentence pairs, it would be interesting to see the results on other language pairs. The immediate future work involves experimenting the models with other embeddings and comparing it with other existing machine translation metrics .

Important Links

Acknowledgements

A huge vote of thanks goes to my mentors, Tiago Torrent, Ely Matos and Oliver Czulo and to the entire team of FrameNet Brasil. Without them this project would not have been possible. Be it data requirements, or lengthy discussions on what could be done to mitigate certain challenges, they were the ever supportive mentors one could ever ask for. It was a great pleasure to work with such great minds and I would like to thank Google for providing students with such a great platform and opportunity. This journey, will indeed, be a memorable one!

References

[1] C. F. Baker, C. J. Fillmore, and J. B. Lowe, “The berkeley framenet project,” in Proceedings of the 17th international conference on Computational linguistics-Volume 1. Association for Computational Linguistics, 1998, pp. 86–90.

[2] Ted Talk : “Do schools kill creativity?” by Sir Ken Robinson

[3] Enriching Word Vectors with Subword Information, Piotr Bojanowski, Edouard Grave, Armand Joulin and Tomas Mikolov, 2016

[4] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[5] Garrette, D., Mielens, J. and Baldridge, J., 2013, August. Real-world semi-supervised learning of POS-taggers for low-resource languages. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 583–592).

[6] Weiwei Ding. 2011. Weakly supervised part-ofspeech tagging for Chinese using label propagation. Master’s thesis, University of Texas at Austin.

[7] T. T. Torrent, L. Borin, and C. F. Baker, “International framenet workshop 2018,” 2018.