Unsupervised competing neural language model for word segmentation

Coupang’s natural language processing technique that segments Asian languages automatically without human intervention

Coupang Engineering

Published in

Coupang Engineering Blog

8 min readJul 29, 2022

By Hughes Yu and Bin Wei

This post is also available in Korean.

In natural language processing (NLP), language models are a crucial component that learns to predict the probability of a word occurrence for the given text. For the Coupang e-commerce app, we use the language model to categorize our catalogs, predict what the user is typing, provide results based on the search keywords that customers type in, and provide personalized product recommendations. It is important to provide accurate and relevant results when the user searches products using product titles and aliases. We use the neural language model as it overcomes the shortcoming of statistical language models. But modeling for e-commerce poses different challenges to modeling for usages like machine translations and general web searches. Most models need to learn to understand sentences or conversations covering different topics but for e-commerce, models only need to learn to correctly identify products titles, product type, brands, and solutions to a problem like “acne treatment.”

Word segmentation (also called word tokenization) is important to break down product titles or aliases that the user types in to a way that the machine can understand. However, the widely used approach of supervised word segmentation is not suitable for character-based languages like Korean, Chinese, and Japanese. This post explains how Coupang devised a fully automated method of segmentation for Asian languages without the need for labelling.

Supervised vs. unsupervised word segmentation
System overview
Example of running the system
Embedding model
Forward and backward language models
Word segmentation algorithm
Conclusion

Supervised vs. unsupervised word segmentation

Language models most commonly operate at the word level. A piece of inputted text is split whenever a delimiter, like a white space, is identified. Such segmentation technique is usually based on supervised algorithms trained on manually segmented and labeled corpus of news articles. However, this is not suitable for our needs since the context and grammar usage of words used in news articles are significantly different to the context of product titles or keywords that users enter in the search field of an e-commerce website.

Language models also struggle with many East Asian languages that have different rules and restrictions compared to English. For example, Korean language has postpositions that suffix a noun or pronoun which can make it difficult to segment words into accurate keywords. Chinese and Japanese do not have conventional delimiters. In Korean, delimiters can be omitted in informal writing, especially between the noun strings that form the product title.

As Coupang’s main markets are currently located in Asia, improperly segmented words for Asian languages can affect the accuracy of search results and lead to poor user experience. This can result in lost opportunities and even reduction in sales. Therefore, we adopted the unsupervised segmentation method that does not rely on a pre-segmented or labeled corpus. The algorithm learns and discovers structure in data to predict the likely character.

System overview

We built the word segmentation system for the Korean language that can be extended to other languages like Chinese and Japanese with minimal work. The overall structure of the system is shown below in figure 1. The system contains an embedding model and two language models on top of embedding. All these models are trained without human effort and rely on probability distribution over words.

Diagram showing an overview of Coupang system — **Figure 1**. Overview of the system

1. Catalog: We use all the product titles in the Coupang catalog to train the character embedding model — that’s more than 1 trillion unlabeled titles.

2. Embedding model: The data is fed into the embedding model. The model receives a character and outputs the corresponding embedding vector. This is similar to Word2Vec, a popular pretrained embedding technique.

3. Forward and backward Language models: Language models receive the embedding vector and its context embedding vector sequence, and outputs the probability of the character-context pair. The probability is compared between the forward and backward language models and segmentation is carried out where necessary.

Example of running the system

Here is an example of how the segmentation system works.

Let’s take a product title “사생활보호필름” which means a privacy screen protector in Korean. It needs to be segmented into “사생활” (literally meaning privacy) and ”보호필름” (literally meaning protection film).

First, we select “활" locating around the middle of the product title and use the forward model to predict the probability of the character “활” following “사생.” Let’s say that we got 0.57. Then, we get 0.03 for the probability of “활” leading “보호필름.” The system now knows that “활” belongs to word “사생활.”

Then, “사생활” is fed to the forward language model to predict the probability of “보” following “사생활.” “호필름” is fed to the backward language model to predict the probability of character “보” leading “호필름.” Since “보호필름” is a whole word, we find that the latter probability is higher than the former. So, we apply segmentation to start a new word and get the segmented “사생활" and “보호필름" from “사생활보호필름”.

Embedding model

The embedding model maps individual characters to the corresponding uniform-sized vector. We need the embedding model since it can map frequently adjacent characters to similar vectors, thus providing higher precision than random mapping for language models. And to enable the model to work in an unsupervised way without the need for labelling, we designed it to predict the word in given context.

First, let’s say that we have a dataset of characters D which is made up of all the sequence of characters available in a dictionary for a language.

D= c1, c2, c3, … cn

We can denote product title T like below with q being the index for corresponding character of the product title.

T = cq1, cq2, cq3, …, cqn

A specific character in D becomes a one-hot vector ci, with i being the index for the character dataset. The dimensions of ci are the number of characters in the word sequence. The i-th dimension of ci is assigned with the value one and the rest is assigned with zeros.

The one-hot vectors of a product title are summed and fed to the projection layer as cqi-2, cqi-1, cqi+1, cqi+2. cqi includes context information of that character in the product title T. The projection layer then maps the one-hot vector to a dense uniform-sized vector. Finally, the dense vector is further applied with a soft-max function to predict the embedding vector

Diagram showing Coupang’s embedding model to find one-hot vector of cqi — **Figure 2**. Embedding model to find one-hot vector of cqi

Forward and backward language models

A language model is a sequential model that accepts sequence of words (sentence) as corresponding vectors and predicts the vector of the input sequence. We adapted the model to accept character sequences and predict the character before and after input sequence. Based on the predicted character, we can further measure the probability of the actual character that follows or leads the input sequence.

We use two character-level language models to examine text bidirectionally and increase the accuracy of the result. Our two models are based on Recurrent Neural Network (RNN). We chose Gated Recurrent Unit (GRU) since GRU delivers good performance by using less training parameters and consumes less resources than other RNNs such as LSTM.

Our language models are trained using the same dataset and a window size of m. The forward model predicts the probability of the character following the former part of the sequence. The backward model predicts the probability of the character leading the latter part. For the forward language model, m embedding of characters before cqi are first converted to vectors by the embedding model using cqi-m, cqi-m-1, … , cqi-2, cqi-1. It is then inputted to the language model as a vector sequence. The expected output of the model is the embedded vector of cqi. The loss function of the model is the Euclidian distance between the cqi embedding vector and the prediction of the model. The same is carried out by the backward model to input m embeddings of characters after cqi as cqi+1, cqi+2, … , cqi+m-1, cqi+m and output another cqi.

Illustration showing operation of the forward and backward language models — **Figure 3**. Operation of the forward and backward language models

The probability of predicting the next character or the character before in a sequence is measured using the reverse exponent of the Euclidian distance between prediction and actual character embedding.

For example, suppose the actual embedding is cqi and prediction of the model is cqi’. Context of cqi is con_i and inputting con_i to the model outputs cqi’. Then the probability of cqi that supersedes con_i for the forward direction model can be calculated with the following formula.

exp(-sqrt((cqi-cqi’)²))

Word segmentation algorithm

The probabilities are then compared (and in a way, competed). If the probability is higher for a backward model, the character will be seen to be associated with the latter word, and segmentation will be carried out. This process is repeated for the product title until all characters belong to a word. Here is an example of a bogus code for the segmentation algorithm.

Input: product title T=cq0, cq1, cq2, …, cqn-1.Output: a word boundary sequence S of T0  forward_start_pos = 0, reverse_start_pos = n-11  for i : 1<i<n-12      if F(SUB_SEQ(forward_start_pos, i-1)) - R(SUB_SEQ(i+1, reverse_start_pos)) > seg_score_threshold3           forward_start_pos = i4           S ← i5  end for6  return S

The F in line 2 represents the forward model, and R represents the backward model. We go through all the characters in the sequence and compare the probability of the two models. When we have the probability difference of the two models, seg_score, we apply a threshold to get seg_score_threshold. Segmentation is applied when the probability of the backward model has a meaningfully higher probability than the forward model. At the same time, we use the threshold to check that the maximum number of segments are less than the expected level.

For example, if there are 30 characters in a product title and we know that the average word length is 3, then the number of segments will be less than 10.

Conclusion

Word segmentation is one of the fundamental tasks for natural language processing. You can find some information about other character to word segmentation techniques for Asian languages, but they require heavy human effort to label data. Our described solution only needs the raw data of product titles and works in full automation with zero labelling effort.

This post introduces you to only a small part of our work that builds the knowledge graph of the catalog, and helps areas like search, recommendations, and advertising. We are also encountering more and more challenges regarding the support of multi-languages as Coupang expands its services into international markets.

If you wish to fuel the next stage of growth in many of our global locations, check out the open positions.