How to use your text data effectively!

Published in

The HumAIn Blog

8 min readJun 20, 2022

Data is fast becoming the currency of the information age, the saying: “You’re not more important than the data you produce”, might be a little too extreme but it conveys the truth to a significant extent. A lot of the data produced and stored in the many mediums present today is raw, unparsed, untagged and would take too long for people to sit down and make machine learnable.

We, at Goalist, face this problem on a daily basis, we crunch a lot of crawled and submitted data. Data that includes feedbacks, suggestions, submission information, descriptions and a lot of correspondence information that mostly is kept for recording purposes. About 10% of the data we collect is fed into recommendation engines that we have running as this portion is carefully put into a cycle of parsing, training and feedback. The data we consume and store is a great help to the AI teams that we have working constantly to automate many of the tedious tasks that require a great amount of man hours.

Many of the natural language downstream tasks that we perform include summarization, translation and classification. A lot of these tasks are in very specific domains, in Japanese, which narrows down the transfer learning techniques that can be applied in this case. We could go ahead and choose generic Japanese Language Models but these do not work well in our specific use cases. Transformers are the standard for solving NLP tasks in this day and age and we are not going to reinvent the wheel.

In this article, I am going to show you how to domain adapt a generic Japanese Language Model, after which you can attach model heads to them and try to accomplish one of the downstream tasks mentioned earlier. A lot of what I mention here are the same techniques we use at Goalist when domain adapting our data. This article is an unwrapping of the first phase of our NLP MLOps pipeline.

Here is a brief of things we are going to cover:
1. Domain Adaptation: The technique in detail and why its use is effective.
2. Framework of Choice: What to use when going into practice.
3. Preprocessing: Making raw data machine learnable.
4. Modeling: Strategy to utilize when domain adapting.

Domain Adaptation

Domain adaptation, it’s when we fine-tune a pre-trained model on a new dataset, and it gives predictions that are more adapted to that dataset.

Domain adapting models written specifically for the Japanese language is a very good place for us to start fast prototyping the ideas that we have in mind at Goalist. This gives us a freedom to try out various ideas on incoming data which is, as mentioned earlier, coming in as a stream and so much of it is new, even to us. So time becomes a very important factor when trying to fit data to a task.

Framework of Choice

We, at Goalist, use huggingface for most of our NLP MLOps needs, it is really a useful framework for training, tokenizing, storage, inference and is really easy to maintain! But the most important advantage that huggingface gives us and is our tool of trade is because of its ability to deal with unstructured data.

Another great benefit is that a person lacking domain specific or linguistic knowledge of the target corpus can still train the model without worries, if the tokenizer chosen has the knowledge to encode the context of the corpus.

An example of unstructured data that comes in for ingestion

As can be seen from the above image, the data that comes in for ingestion is very unstructured, it is mostly in a comma delimited or JSON format and does not have consistent tags as documents from different departments and usecases are merged together just before parsing.

Preprocessing

The best practice to adopt is to use the Datasets library, also by huggingface, which has a handy function to stream data directly from archived files, which is used by us a lot.

dataset_dict = { "text": raw_datas }
dataset = Dataset.from_dict(dataset_dict)

Here I parse texts after a little cleaning and put it into a datasets class. Now I can save it to cache for easy and fast loading! Raw data can be as simple as a list of strings that the model needs to be trained on.

The model of course cannot process these raw strings, so we need to tokenize the strings to ids and attention masks. The model and tokenizer that we will use for this task are this one:

nlp-waseda/roberta-base-japanese · Hugging Face

Edit model card This is a Japanese RoBERTa base model pretrained on Japanese Wikipedia and the Japanese portion of…

huggingface.co

I chose this as a starting point for a generic Japanese model because it had a decent vocabulary size and has been trained on all of Japanese Wikipedia and the Japanese part of the CC-100 dataset. We also needed to select a model that has been trained for Language Modeling as this model’s base can encode the context of the corpus on which it has been trained on. These are the pre training objectives which help the model learn better language representations of its training corpus.

Now, we need to retrain the tokenizer on the new texts so that any missed tokens from this corpus are now included in the new tokenizer, like so:

tokenizer = transformers.AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
org_len = len(tokenizer)
new_tokenizer = tokenizer.train_new_from_iterator(raw_datas, org_len + 1000)

I choose to increase the vocabulary size by a thousand because I checked against the raw training corpus and counted the number of unknown tokens and their frequency. Fair warning though, training a new tokenizer from an old one is only possible for FastTokenizers class of huggingface. For tokenizers that aren’t fast we need to train a new tokenizer from scratch, here is a tutorial on how to do so:

notebooks/tokenizer_training.ipynb at main · huggingface/notebooks

Notebooks using the Hugging Face libraries 🤗. Contribute to huggingface/notebooks development by creating an account…

github.com

Now that the tokenizer has been expanded, we need to resize the embedding layer so as to accommodate the new tokens of the augmented vocabulary.

model = transformers.AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese").to('cuda')
model.resize_token_embeddings(len(new_tokenizer))

At this juncture, we have our model ready for training, our tokenizer is ready to deal with any new context that it sees in the training corpus and the number of unknown tokens is reduced. We already have our dataset as list of strings ready for tokenization, which remains the only step before the fun part, i.e. training!

This is where the strength of the datasets class comes into play, it has many methods which make it very easy for the user to take archived unstructured data stored locally to batch encoded streams ready for training. One of the cooler aspects of the datasets class is its ability to keep indexes like traditional databases, the most important being semantic similarity indexes like FAISS using very little memory. To read more about it you can go to huggingface documentation:

Datasets

🤗 Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language…

huggingface.co

One of its awesome methods is map, which is very similar to mapReduce functions used in big data DB’s where data comes in a stream and these functions act as transformers in an ETL pipeline. Dataset.map() works in exactly the same way. Some of the more powerful applications of Datasets come from using map. The primary purpose of map is to speed up processing functions. It allows you to apply a processing function to each example in a dataset, independently or in batches.

We will first tokenize, drop the text column because we do not need it anymore and utilize the rest of the ids for training. Then we will group texts in a batch together to remove some of the sparsity created by padding, and removes the loss of information caused by truncation, chunking texts together is a good strategy to follow when going for a language modeling task because you want to encode as much information possible into the model without it learning any trivial representations. The last thing we do is split the dataset into train and test groups for training and validation. An added note, if you use batching in your map functions you must ensure that your mapping function can handle an array of examples compared to vanilla map where the function should take care of a single example, I have used both below for further reference.

new_tokenizer.pad_token = new_tokenizer.eos_tokendef tokenize(example):
    return new_tokenizer(example['text'])def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return resultdataset = dataset.map(tokenize)
dataset = dataset.remove_columns('text')
dataset = dataset.map(group_texts, batched=True)
dataset = dataset.train_test_split(test_size=0.1)

Modeling

Now, for a Language Modeling pre training objective we must define a data collator, which will help us prepare texts for the model in a batched fashion. We will go for Causal Language Modeling as our pre training objective. Under Causal Language Model, the idea here is again to predict the masked token in a given sentence. The model is allowed to just consider words that occur to its left for doing the same. For more knowledge on the various language modeling pre training objectives, you can look at this article:

Understanding Masked Language Models (MLM) and Causal Language Models (CLM) in NLP

Language Models in NLP (Visuals and Examples)

towardsdatascience.com

Data Collators are a huge help also when training, because they apply dynamic padding when preparing batches for training.

As we can see dynamic padding reduces the sparsity of the dataset after padding and goes into the whole theme of reduction of information loss. Training speed also increases because length of batch fed into the model is reduced. More on dynamic padding can be found here:

Now the easy part, we define training arguments and just call one method to train, here we see the power of the huggingface framework, which abstracts away setting up a training loop, an optimizer, checkpointing and many more things!

data_collator = transformers.DataCollatorForLanguageModeling(tokenizer=new_tokenizer, mlm=False)training_args = transformers.TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=1)trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
)trainer.train()

That’s all, just attach any downstream task specific head to do NER, classification, summarization, translation, text generation, question answering and more! Fine tuning for those tasks with very few to no examples (zero shot learning) can be done very easily now that our upstream model has domain adapted and learned the language representations.

At Goalist, we have multiple downstream models performing various tasks, inheriting from a single huge upstream model very much like the one which we trained now. Upstream models in our organisation get updated weekly as data comes on in a stream!