Using Hugging Face Models on Non-English Texts
How to use pre-trained English language models from Hugging Face on non-English texts
A pre-trained model is a saved machine learning model that was previously trained on a large dataset (e.g all the articles in the Wikipedia) and can be later used as a “program” that carries out an specific task (e.g finding the sentiment of the text).
Hugging Face is a great resource for pre-trained language processing models. That said, most of the available models are trained for popular languages (English, Spanish, French, etc.). Luckily, many smaller languages have pre-trained models available for translation task. Here, I’m going to demonstrate how one could use available models by:
- translating the input text to English,
- carrying out the specific task by using pre-trained English model,
- translating the result back to original language.
I use Estonian (spoken natively by about 1.1 million people) as a input language and evaluate the practicality of this workflow on following tasks:
- Sentiment analysis.
- Extractive question answering.
- Text generation.
- Named entity recognition.
- Summarization.
Translation
Translation is a task of translating a text from one language to another. This will be the first and the last task in each of our example. One of the ways to access Hugging Face models is through their Inference API that enables to run inference (to ask something from machine learning model) without locally installing or downloading any of the models. To get started, you need to register in Hugging Face and get your API token on your profile.
Using the API involves:
- Selecting the model from the Model Hub and defining the endpoint
ENDPOINT = https://api-inference.huggingface.co/models/<MODEL_ID>.
- Defining the headers with your personal API token.
- Defining the input (mandatory) and the parameters (optional) of your query.
- Running the API request.
This is illustrated by the following end-to-end code example where we 1. define an API endpoint by copying a model name from the model library; 2. Setting an API token, which you can find from your user settings after registering an account in Hugging Face; 3. Defining the function that makes a POST request to the API; 4. Defining the input text that I would like to translate and running the API query with this input and 5. extracting the results. NB: for the following translations, we only need to repeat steps 4 and 5 (defining the new input and extracting the results).
Sentiment analysis
Sentiment analysis is a task of classifying the input text as positive or negative. The number of possible classes depends on the specific pre-trained model. For example, some models use only two classes (positive, negative) while others use three (positive, neutral, negative) or more.
For sentiment analysis, I use the Transformers library which is another simple option next to Inference API to access pre-trained models. Install the library by !pip install transformers
in your notebook or pip install transformers
in terminal.
The easiest way to use a pre-trained model on a given task is to use pipeline('name-of-the-task')
function. The library downloads pre-trained models for specific task and the inference is run on your local machine (recall, that if you don’t want to download the models, you can use the Inference API). Let’s see how it works on sentiment analysis:
- First, we create a classifier with the pipeline function and the name of the task. You can find the available tasks from documentation.
- Next, we translate the input text from original language to English.
- Finally, we run sentiment analysis with a single line of code and extract the results. The example below is a success — the translation is okay “ We wanted the best, but it turned out the way it always did.” and we correctly classify the sentiment of the sentence as negative.
Here’s another example but this time the task failed due to incorrect translation. The translator returned “The best argument for democracy is a five-minute conversation with the average voter” but the correct translation would have been “The best argument against democracy …”. Possibility of wrong translation is something to keep in mind while translating and using pre-trained English models.
Extractive question answering
Extractive question answering is a task of answering to a question based on an input text. In the following example, I’ll give a general description about the unit that I’m working in and try to answer the question “what is this unit doing?”. Let’s walk through the code below:
- Create the question-answering pipeline (Line 2).
- Provide the context in original language and translate the context to English.
- Provide the question in original language and translate it to English.
- Run the question and the context through the pipeline and extract the answer.
- Translate the answer back to original language.
The output above reveals that translation is not perfect but it’ll do the job. The answer to the question “What is E-Lab doing?” is also a bit edgy but the core meaning is correct.
Text Generation
Text generation (aka causal language modeling) is the task of predicting the next word(s) given the start of the sentence. In other words, this is the task, where machine learning model tries to be a writer 😄!
In the example below we:
- generate the text-generation pipeline.
- Define two sentence beginnings in Estonian and translate them to English.
- Use the pipeline to generate text based on the beginnings and limiting the number of generated words to 50.
- Translate the generated text back to Estonian.
As you can see from the resulting text above, our model generates a fair amount of nonsense like “Estonia produces electricity from natural gas-fired hydroelectric power plant” 😄. That said, the Est-to-Eng-to-Est workflow seems to work well and the translation quality in this example is really good.
Named Entity Recognition
Named Entity Recognition (NER) is the task of trying to find the names of persons, locations, organizations from the text. In the example below, I’ll give two Estonian sentences as an input and try to detect all the named entities from it. Let’s walk through the code below:
- Initialize a named entity recognition pipeline and define the reasonable names for the classes that our model outputs. Here, I create a dictionary using the class codes as keys and Estonian meanings as values. These will be used later.
- Define the input text.
- Translate the input to English.
- Define the function for NER: Our function a.) runs ner pipeline with input, b.) replaces all cryptic class names with reasonable names in Estonian (in for loop) and c.) groups together strings/tokens that belong to the same entity (you can try the function without grouping to see the raw output).
- Print all off the entities with the class that they belong to.
The results above show that the model was able to correctly detect all of the named entities and divide them into the corresponding class (organization or person).
Summarization
Summarization is a task of taking a long text or document and summarizing it into a shorter text. In the example below, I use Google’s T5 model that was trained on a mixed input data (including documents from CNN & Daily Mail). Again, we can follow the already familiar workflow:
- Initialize the Transformers pipeline with the “summarization” task.
- Provide an input sentence or document.
- Translate the input to English.
- Summarize the text using pre-trained English model. You can provide the maximum and minimum length of the summary as an argument — here, I limited the length of the summarization to 30 tokens.
- Translate back to input language.
The summarizer above was able to capture the core ideas of the input very well but the output has some grammatical problems. Thus, we can expect that the model should be used for aware user to quickly capture the core meaning on long documents.
Summary
This article demonstrated an idea of using pre-trained English language models on non-English languages by using translation as a part of workflow. The concept was tested on five different tasks:
- In sentiment analysis, the quality of the output is heavily dependent on the quality of the translation. Here, we tried to detect sentiment of two Estonian sentence — one of the examples was successful and the other one failed due to translation error that changed the meaning of the sentence.
- In question answering, we saw that almost nothing was lost in the translation for our example.
- Text generation also works but its hard to see any “real” use-cases for this task because the output is a pure fiction.
- In named entity recognition, we didn’t lose much in the translation and our method is well-suitable.
- In text summarization, the models should be used to capture the core ideas of the longer texts but not to generate grammatically correct text.
In short, the concept of translate -> use pre-trained English models -> translate back is a useful method to do various Natural Language Processing tasks on smaller and non-popular languages.