Introducing KARAKURI LM

Published in

KARAKURI Techblog

5 min readFeb 2, 2024

We would like to introduce our new language models: KARAKURI LM 70B v0.1, a pretrained model designed for the Japanese language, and KARAKURI LM 70B Chat v0.1, its counterpart fine-tuned for conversational use.

KARAKURI LM is a pretrained language model that builds upon Llama 2.
Our model enhances Llama 2’s capabilities by incorporating additional Japanese vocabulary and further pretraining on a mixture of Japanese and multilingual corpora.

KARAKURI LM Chat is a fine-tuned version of KARAKURI LM, which was trained on a mixture of publicly available and closed datasets.
Despite the conversational datasets containing only 2.5% Japanese tokens, our model has shown remarkable performance.
It achieves the highest performance among Japanese open models on the MT-Bench-jp benchmark at the time of release.

Both of these models are available on the HuggingFace Hub:

In this blog post, we’ll be delving deeper into the technical aspects of KARAKURI LM.

Continual Pretraining

In the creation of our training data, we made use of not only publicly accessible corpora such as mC4 and RedPajama, but also a Japanese corpus that we gathered in-house.
By merging these, we constructed a training dataset comprising roughly 100 billion tokens.
This dataset includes tokens in languages other than Japanese, accounting for about 20% of the total.

We employed Llama 2 70B as the initial weight for the model and incorporated Japanese vocabulary.
It’s worth noting that we filtered the additional vocabulary to exclude infrequently used minor kanji characters and emojis.

This model was trained using 16 billion tokens of data.
Regarding the hyperparameters, we used the values specified in the Llama 2 paper.

The training was carried out on 32 nodes of an Amazon EC2 trn1.32xlarge instance and took approximately three days.
We utilized the neuronx-nemo-megatron for the distributed learning library.

Fine-tuning

For our training data, we utilized the OASST2 and a conversational dataset that we created ourselves.
Our custom dataset comprises approximately 1,000 conversations.
We incorporated non-Japanese conversations from the OASST2 without translating them into Japanese.
This decision was informed by our past experiences with models trained on translated text, which indicated that using machine-translated data for training could potentially degrade the naturalness of the Japanese language and subsequently diminish performance.
The training data contains roughly 36M tokens, of which around 2.5% are Japanese tokens.

Given the limited number of Japanese tokens, training could result in catastrophic forgetting, potentially leading to a significant decline in Japanese performance.
To mitigate this, we implemented a continual learning approach.
Specifically, we utilized untrained data from the dataset prepared for continual pretraining and carried out multi-task learning by concurrently solving pretraining tasks.
This is anticipated to provide a regularization effect, preventing significant deviations from the parameters obtained through continual pretraining.
We set the ratio of mixed-in pretraining tasks at approximately 20% of the total.

For fine-tuning, we chose SteerLM over SFT or RLHF.
SteerLM, a simpler alignment method proposed by NVIDIA, is more straightforward than RLHF.
It uses datasets that are annotated with attribute labels for its training.
The training process is entirely offline, eliminating the need for online data generation or evaluation.
During generation, the behavior of the language model can be controlled by modifying the values of these attributes.
As it can be trained within the framework of supervised learning, it’s easier to implement and offers more stable learning compared to reinforcement learning.

We used a learning rate of 1e-5, which is an order of magnitude smaller than what was used during pretraining.
Additionally, we opted for a slightly larger batch size of 256 to ensure that both Japanese and English tokens are adequately represented in each batch.
By backpropagating with a batch that contains sufficient tokens from both languages, we can obtain gradients that incorporate information from both languages, which we hope will facilitate cross-lingual transfer.

We used Swallow 13B for the attribute prediction model of SteerLM, and trained it with OASST2 and HelpSteer.

The training was carried out on 2 nodes of a trn1.32xlarge instance and took approximately 10 hours.
We utilized the neuronx-nemo-megatron for the distributed learning library, consistent with our approach for continual pretraining.

Evaluation

The model’s performance was assessed using MT-Bench and MT-Bench-jp.

MT-Bench is a benchmark developed by LMSYS, designed to evaluate a model’s multi-turn conversation capabilities.
It enables the evaluation of a model’s ability to follow instructions and maintain consistency in multi-turn conversations.

MT-Bench-jp is the Japanese version of MT-Bench, localized into Japanese by Stability AI, with the leaderboard hosted by W&B.

MT-Bench-jp

At the time of release, KARAKURI LM 70B Chat v0.1 achieves the highest performance among Japanese open models on the MT-Bench-jp benchmark.
The score for KARAKURI LM Chat v0.1 is based on our own experiments, while the scores for other models are cited from the Nejumi LLM Leaderboard Neo.

MT-Bench

It also achieves performance comparable to Llama 2 70B Chat on the original English MT-Bench benchmark.
The score for KARAKURI LM Chat v0.1 is based on our own experiments, while the score for Llama 2 70B Chat is calculated from the experimental results published by LMSYS.

Acknowledgements

We gratefully acknowledge the support from AWS Japan through the AWS LLM Development Support Program.