JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset

Ruth-Ann Armstrong
6 min readDec 3, 2022

--

Paper by Ruth-Ann Armstrong, John Hewitt and Christopher Manning

Figure 1: Linguistic features relevant for textual entailment classification for Jamaican Patois and lexical overlap with English.

Though more than 7,000 languages are spoken globally, NLP research has been focused primarily on around 20 high-resource languages throughout recent years. On the other hand, thousands of low-resource languages — those that are less studied, resource scarce, less computerized, less privileged, less commonly taught or low density — are under-explored. Creole languages, which emerge as a result of contact between speakers of different vernaculars, belong to this class. Though they are spoken by millions of inhabitants of different regions across the world, they are vastly under-studied in NLP.

Figure 2: Map of the languages listed in Atlas of Pidgin and Creole Language Structures (APiCS).

Creoles are uniquely interesting they often share high overlap with high-resource languages. For this reason, they provide an opportunity to study cross-lingual transfer between languages with distinct morphosyntactic features and high lexical overlap.

As such, adding Creole datasets to the existing NLP ecosystem is beneficial both for advancing research related to cross-lingual transfer learning and for more accurately representing the breadth of languages spoken across the globe by introducing under-explored dialects to the research community.

The JamPatoisNLI dataset is one step towards this aim. Ours is the first natural language inference (NLI) dataset in Jamaican Patois, a Creole language spoken by over 3 million people on the island and across the diaspora. Jamaican Patois is an English-based creole which emerged as a result contact between enslaved African people forcibly brought to the Caribbean in the 17th century and British colonists.

For NLI, the input to the task is a pair of sentences: the premise and the hypothesis. The goal is to output a label — entailment, neutral or contradiction — to describe the relationship between the pair. The dataset consists of 650 NLI examples split across training (250), validation (200) and testing (200).

Figure 3: Random sample selected from the 100 double annotated examples in the corpus, with their gold labels and validation labels (abbreviated E, N, C) by each of the annotators.

Since Jamaican Patois is low-resource and is primarily a spoken language, there is a limited number of naturally occurring examples and a limited availability of native speakers to annotate them. However, the language is regularly used for communication on social media, and in literature. Around 97% of examples are drawn from Twitter and the remaining examples are drawn from a cultural website and from poetry by Dr. Louise Bennett-Coverley and Shelley Sykes-Coley. Then, corresponding hypotheses were hand-written by our first author, who is a native speaker. Lastly, a random sample of 100 sentence were pairs double annotated by fluent speakers of Jamaican Patois with a Fleiss-Kappa accuracy of 88.9%.

Jamaican Patois has high lexical overlap with English yet distinct morphosyntactic features.

Figure 4: In this example, strictly non-English vocabulary which are highlighted in bold, account for less than one-third of the words in the sentence. Therefore, JamPatoisNLI will be useful for evaluating the efficacy of methods for linguistic transfer in scenarios where there is a high degree of overlap between the source and target language.

Additionally, because it is a hybrid of the languages spoken by the two groups of people that came in contact, the language exists on a continuum that ranges from more dissimilar to less dissimilar to English.

Figure 5: Different translations of ‘I’m eating the food that they gave me’ in Jamaican Patois along the Creole continuum

In our experiments, we explore the effectiveness of transfer from large English monolingual and multilingual pretrained models to JamPatoisNLI in the zero-shot and few-shot settings. We also compare cross-lingual transfer efficacy for JamPatoisNLI to efficacy for AmericasNLI (which consists of NLI examples in various Native American languages) and conduct qualitative which leverage the relatedness between Jamaican Patois and English to better understand cross-lingual transfer.

While our work, along with previous work, shows that transfer from these models to low-resource languages that are unrelated to languages in their training set (as is the case for AmericasNLI) is not very effective, we would expect stronger results from transfer to Jamaican Patois which is related to English which is in the training set for these models. Indeed, our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages, and help us begin to understand how the unique relationship between creoles and their high-resource base languages affect cross-lingual transfer.

In our experiments, we finetune the following pretrained models:

  • Monolingual (English): cased and uncased BERT and RoBERTa
  • Multilingual: cased and uncased multilingual BERT and XLM-RoBERTa

Jamaican Patois is not in the training corpora for these models, but is related to English. In the high-resource language finetuning stage, we combine these pretrained models with a two-layer perceptron with ReLU activations which serves as the NLI classification head and use the English MNLI dataset for training. We experiment with frozen models where weights in the pretrained model are not updated during finetuning, and with unfrozen models where they are updated.

In the low resource language finetuning stage, all models are unfrozen then further finetuned on the JamPatoisNLI training set and then evaluated on the validation and test sets. We also make comparisons to AmericasNLI by segmenting five languages from their dataset into similar sized train-validation-test splits to compare cross-lingual transfer efficacy. Languages in AmericasNLI are not in the training corpora for the models, and are not highly related to English or other languages in the training corpora for the pretrained models.

On the full fewshot dataset (note that a few-shot triple consists of one example from each class), RoBERTa-based models (roberta-unfrozen: 76.50%) outperformed BERT-based models (bert-uncased-unfrozen: 66.17%). There were negligible differences between monolingual and multilingual pretrained models.

Figure 6: Zero-shot and few-shot accuracies for different models evaluated on JamPatoisNLI averaged over three experiments with different seeds. The best monolingual and multilingual BERT-based and RoBERTa-based models were chosen based on results for the validation set.

The relatedness of Jamaican Patois to English boosts the effectiveness of cross lingual transfer in spite of the fact that the languages are different morphosyntactically resulting in far higher accuracies than those achieved on the AmericasNLI languages as shown below. One direction for future research, is determining whether vocabulary overlap is the primary factor that led to the boost in effectiveness of transfer learning in these experiments, or whether a higher order notion of similarity is a larger factor.

Figure 7: Left (among BERT-based models) — Plots for the best model (mbert-cased-unfrozen) on each language, and the best JamPatoisNLI model (bert-uncased-unfrozen). Right (among RoBERTa-based models) — Plots for the best AmericasNLI model (xlm-unfrozen) on each language, and the best JamPatoisNLI model (roberta-unfrozen). Experiments are averaged over three seeds and the best models were chosen based on results for the validation set.

Finally we conducted qualitative experiments on the dataset in the following way. We first gathered 6 NLI pairs with incorrect predictions in the 83-shot setting. We then created a translation path by writing English translations and valid intermediate Jamaican patois translations in between. Next, we trained 3 models with the original training set and observed which step on the translation path led the models to switch to a correct prediction. The models began to predict correctly before complete translation to English for 5 of the 6 pairs. In the particular example below, the translation of the verb led to the correct prediction.

Figure 8: Sample from Jamaican Patois to English transition dataset. The final example is in English, and we present predictions made by three models finetuned with our Patois few-shot training dataset

There are many potential directions for future research using the JamPatoisNLI dataset. Experimenting with novel methods for boosting cross-lingual transfer from source languages to Creoles, extensions of the dataset with other English-based Creoles and language generation are a few examples of additional uses for the dataset.

Cite:

@inproceedings{armstrong2022jampatoisnli,
author = {Armstrong, Ruth-Ann and Hewitt, John and Manning, Christopher D.},
booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)},
title = {JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset},
url = {https://nlp.stanford.edu/pubs/armstrong2022jampatoisnli.pdf},
year = {2022}
}

Dataset and corresponding Github repository:

Project website:

Paper:

Project video:

We hope that you enjoy experimenting with the dataset and that you share any interesting results! Feel free to reach out at ruthanna@stanford.edu or on Twitter.

--

--

Ruth-Ann Armstrong

New-grad in the field of machine learning interested in multilinguality, cross-lingual transfer-learning and socially beneficial applications of technology.