Introduction to NAVER Place AI Development Team

Published in

네이버 플레이스 개발 블로그

13 min readSep 8, 2023

Hello. My name is Yun Sang Ju, and I’m in charge of the AI development team at GLACE (Global Place) CIC (Company-In-Company), which is one of NAVER’s CICs. I would like to introduce the AI development team of the Service Organization, discussing our mission and how we apply AI technology to real services.

I will continue with the explanation following the table of contents below.

Services provided by NAVER GLACE CIC
The mission of NAVER GLACE CIC AI development team and AI products under development/operation
Key considerations for operating service AI products

Services Provided by NAVER GLACE CIC

“GLACE” is a combination of the words “Global” and “Place.” NAVER refers to the process of connecting offline services such as restaurants, guesthouses, and hair salons to online platforms as “Place.” The reason for adding “Global” is that our organization aspires to expand not only in Korea but also in Japan and other countries. The image below represents the direction pursued by NAVER GLACE CIC, and we are operating various O2O (Online to Offline) services in both Korea and Japan.

Korea

In Korea, we provide a variety of O2O services such as NAVER PLACE, Reservations, and MY PLACE. NAVER Place offers rich, detailed information of millions of business and various points of interest across Korea, including reviews from users who have visited said places. At the time of writing, NAVER Place has accumulated over 350 million reviews and continues to grow.

Additionally, we offer the SmartPlace tool, which allows Business Owners to easily manage their places. Through SmartPlace, Business Owners can manage their NAVER PLACE detailed pages and receive various statistical information that benefits their businesses.

Japan

In Japan, we operate a similar O2O service called LINE PLACE, which is tailored to the Japanese userbase. One of LINE Place’s most well received features is the menu review, giving users the agency to write reviews in more detail than at a per restaurant level.

The Mission of NAVER GLACE CIC AI Development Team and Products

The NAVER GLACE CIC AI development team has a mission to utilize AI based on the data accumulated in NAVER GLACE to provide assistance to services and extract TAGs that can be used for searching.

We utilize More than 15 different AI models For the numerous GLACE services in both Korea and Japan, across two domains: Computer Vision (CV) and Natural Language Processing (NLP), aiming to enhance the service quality. Let me briefly explain the models we develop and operate, starting with the NLP tasks.

OCR Place Matching

OCR Place Matching is an Embedding-Based Text Retrieval model that utilizes OCR parsing text as a query to accurately identify places visited by users among the millions of registered places in GLACE. One of the services operated by NAVER GLACE is the receipt authentication feature. It allows users to take a photo of their paper receipt after visiting a place and upload it so that the system can minimally authenticate their visit, enabling the users to leave reviews. To provide this feature, a variety of ML models are employed. First, when a user uploads a receipt image, NAVER CLOVA OCR extracts text. And through this information the model is able to accurately identify the place the receipt’s uploader would write a review about.

The above image illustrates the detailed process for locating the visited place from the uploaded receipt image. If the user uploads a receipt image with excessive light or distortion, OCR recognition can become challenging, potentially resulting in misinterpreted text with typos. Furthermore, NAVER CLOVA OCR employs a parsing model to perform post-processing on the extracted text using Named Entity Recognition (NER) to identify necessary fields. However, if the Parsing model malfunctions, the order of elements of an address can be changed, as shown in the figure. When searching for the accurate location based on the extracted text, each segment may contain noise, making it difficult to find the precise place using conventional search engines. For this reason, we have developed a model robust to typos, providing location search results with high accuracy of over 90%.

We’re also applying embedding-based retrieval technology to other tasks. Next, let’s look at the task of a Menu search model that is robust to typos.

Menu Matching

To effectively gather menu-specific reviews, the database of place-specific menus needs to be well-organized. While attempting to collect menu information from receipts, we faced the challenge of noisy OCR parsing menu names. To address this issue, we have developed a model to refine noisy menu text and add new menu options to the DB, Which has resulted in better search accuracy for existing menus. Furthermore, we have developed a Sequence Labeling Model that classifies Menu Raw Text into 9 different options, such as Size, Price, and Temperature. This enables precise searching even for menus with varying options.

That concludes our explanation of NLP tasks, and now, let’s move on to CV tasks.

Receipt Classifier

During service we have encountered quite a few users uploading fake receipts for rewards and adverts.To block such abnormal receipts in real time, we have developed a receipt authentication model and applied it to the service. In this way, we aim to improve the user experience and service stability.

Food Classifier

We have developed a model that can accurately classify only food images from user review images. This model can identify over 300 types of foods from both Korea and Japan, enabling versatile applications within the service.

Image Scoring

We have developed a model that scores the quality of the review images uploaded to the GLACE service. This model assigns a score to each review image between 0 and 4, allowing us to prioritize high-quality review images in the NAVER Place and LINE Place “Discovery” tab. This model accurately predicted scores of 3 and 4 with With remarkable accuracy, exceeding 99% on our test sets. Using this, users can easily find high-quality review images in the Discovery tab, leading to increased user satisfaction and service engagement.

High-Quality Review images

Low-Quality Review Images

Object Detection

NAVER Place has accumulated a significant number of reviews from various types of places, such as restaurants, hospitals, beauty salons, and more. However, cases where privacy rights issues arise or inappropriate review images are displayed on the public review tab can result in operational problems for the service. To address these concerns, we have developed an Object Detection model to filter out inappropriate review images, and have applied this feature to the service.

Atmosphere Classifier

You may have had experiences where the same food tastes different depending on the atmosphere of the restaurant. Each establishment has its own unique ambience, which is often captured in the user’s review images. Analyzing these review images to understand the ambience of the restaurant can greatly assist users in finding places with the desired atmosphere. Therefore, we have developed and implemented a model that can identify over 50 atmosphere features from review images.

PlaceLM

Let’s take a look at a recent project we’re focusing on, which is PlaceLM (GLACE-exclusive LLMs). We are developing GLACE-exclusive LLMs for the following reasons.

Currently, we are developing specialized Vision and NLP models for each extracted tag, and over 15 models are in operation for both Korea and Japan. As the number of models increases, operational costs also increase.
Using specialized models for each tag presents challenges in extracting complex semantic tags.
If we use a single model not only for extracting existing tags but also semantic tags, we can improve the quality of tags and reduce operational costs.

To utilize it in the actual service, it’s important to have low inference costs and the ability for the model to quickly adapt to changing data. The services operated by GLACE receive a significant amount of traffic, and there is a high volume of changing places on a daily basis. Therefore, it’s important to set appropriate model parameters to handle frequent updates of new places and process a high volume of traffic effectively.

To address this, we have developed GLACE-specific LLMs named PlaceLM. This model is relatively small to enable frequent updates with the latest place information, handles service traffic with lower operational costs, and makes good performance within the PLACE domain. GLACE LLMs has a size of 12.8B, which is smaller than ChatGPT or HyperCLOVA X.

When developing LLMs, there are usually three steps, which can be compared to cooking steps for better understanding.

Pretrained Language Model

Just as a good broth forms the base of a dish, this step involves training the model on a large corpus of general knowledge.

Supervised Fine-Tuning

Much like using carefully selected organic ingredients to create a gourmet dish, this step uses a smaller, curated corpus to guide the model’s text generation.

Preference Optimization

Lastly, like adding seasoning to complete the flavor, this final step involves incorporating Human Preference into the model to prevent it from generating odd or undesirable text.

Evaluation Task Selection

To assess the results of model training, we have selected the following tasks.

Public Tasks

Our team operates models for Multiclass Classification, Sequence Labeling, and Summarization task, so performances in the following two tasks are crucial:
NSMC (NAVER Sentiment Movie Corpus, analysis for movie reviews: positive/negative)
KorQuAD (Korean QA Benchmark)

GPT-4-based Evaluation

To assess the model’s overall generation capability, we have included the following evaluation.
G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment (Liu, Yang, et al. 2023). This approach evaluates the model’s overall generation performance through GPT-4.

Internal Task Evaluation

There are several tasks that PlaceLM aims to solve within the GLACE service. To compare its generation performance with GPT-4 on these tasks, we have included this evaluation task.
We convert internal tasks into prompts and compare the generation results of GPT-4 and PlaceLM using the ROUGE score.

Pretrained Language Models

Dataset

We conducted training primarily using the data of NAVER and the training information goes as follows.

Total number of Tokens: 71.23B
Detailed ratio of Training Data

Training

The training was conducted using 100 A100/V100 GPUs, and each epoch took 14 days.

A total of 2 epochs of training have been completed.

Supervised Fine-Tuning

Dataset

We have established high-quality training data with instructions for the tasks we want to address within the PLACE’s domain.

Total number of documents: 622,426
Detailed ratio of Training Data

Training

We conducted training using the LoRA and QLoRA methods, which are one of the Parameter Efficient Fine-Tuning methods.

We used 8 A100 GPUs for the training, and it took 2 days to complete.

Preference Optimization

The Preference Optimization training is currently in progress, and we have completed establishing the training data.

Datasets

For the instructions we want to address, we generated multiple responses from our internal model and then organized them into the “Chosen” and “Rejected” categories.

We conducted training primarily using the data of NAVER and the training information goes as follows.

Total number of documents: 31,523
Detailed ratio of training data

Evaluation

Public Task Evaluation

PlaceLM PLM achieved the highest performance among publicly available Korean LLMs, outperforming some SFT models as well. Among the comparative models, the SFT model achieved the highest performance.

Evaluation Based on GPT-4

Among the LLMs publicly available in Korea, Kullm’s performance is the best in this benchmark. However, despite the fact that the test prompts for this evaluation mostly contain general content not present in the PlaceLM training data, the model performance is comparable to Kullm. In conclusion, our model shows very high performance in public tasks like NSMC and KorQuAD compared to Kullm and demonstrates similar performance in GPT-4 based evaluations.

The graph is taken from https://github.com/nlpai-lab/KULLM. The metrics are Understandable, Naturalness, Maintains Context, Interesting, Uses Instructions, and Overall Quality.

Internal Task Evaluation

Even though it’s difficult to share detailed tasks, the highest Rouge-L score among internal tasks is 0.823, and PlaceLM generates responses similar to those of GPT-4 at a high rate.

Future Plans

We plan to add Japanese data to PlaceLM to address a variety of issues in both Korea and Japan. Additionally, we are making efforts to use PlaceLM as a Language Encoder to expand it into a Vision Language Model. We intend to utilize this model to replace or enhance the NLP and CV models operated by our team and extract more diverse TAGs.

Key Considerations for Operating Service AI Products

Training Data Version Control

As the number of models increased, We’ve found it increasingly difficult to manage our training data. This led to the importance of reusability and history management of training data. We utilized DVC (Data Version Control) to version the training data. DVC stores actual binary data in separate storage and uses Git for version control. By leveraging DVC, data and model management become more convenient, allowing us to track and manage model performance changes based on data variations. Moreover, it facilitates better collaboration among multiple models and team members and allows for easy restoration of version history for training data. In this way, we have enhanced project efficiency and stability, while maximizing productivity in model development and management processes.

Connecting Training and Serving

When addressing customer feedback, it is often necessary to examine the training output artifacts of deployed models to address certain issues. By utilizing MLFlow tracking and model registry, efficient tracking of model versions and experimental data becomes possible, enabling rapid responses in the event of problems. Additionally, it allows for the precise reproduction of the model’s training process and result verification, assisting in enhancing and improving the model’s performance.

Model Degradation

A service is like a living organism that evolves in real time with its users’ changing patterns. The performance of trained models can degrade over time. To prevent this, we monitor the model performance every day, statistically analyze model outcomes, and Redress model degradation before users notice it. If the distribution of model outputs goes beyond a certain threshold, we retrain the model using the latest data for its deployment.

Below is an example distribution graph of the OCR Place Matching model.

Limited GPU Resources

To minimize the idle time of high-cost GPU resources, we have set up multiple devices to efficiently operate in both training and serving environments.

Training

In the training environment, we have established a training cluster using Ray to manage GPU resources.

By ensuring that specific individuals do not monopolize GPU resources and that resources are utilized only during training, we have minimized idle time.

Serving

For serving models on CPUs instead of GPUs, we take a variety of steps.

The developed models undergo quantization and knowledge distillation to achieve lightweight while maintaining similar performance
Various tunings are employed to optimize neural network inference on CPUs

The developed models undergo the above processes and are deployed to production. Over 50% of the models operated by our team are served on CPUs. For models that are difficult to serve on CPUs, HPA has been applied to maintain the minimum number of replicas.

Enhanced Model Performance Testing and Seamless Deployment

We have models that are trained and deployed daily to maintain performance. To prevent deployment issues from affecting the entire service, we have established a stable and seamless automatic deployment process. For this purpose, we utilize PrePromotionAnalysis of Argo Rollouts, which ensures that the deployment proceeds only if the tests are approved. This approach ensures that only a stable model is deployed, minimizing service disruptions. Additionally, once the deployment is completed, we use PostPromotionAnalysis to measure the performance of the model and receive notifications through messenger programs. This allows us to easily assess whether the deployment was successful and to monitor any performance changes. As a result, we can quickly review deployment-related information and promptly address any issues that may arise.

This was a brief introduction to the NAVER GLACE AI development team.

Like others, the AI field is showing some rapidly changing trends. Our team is aligning our research direction with these fast-evolving trends and developing AI products that are beneficial for our services.

We welcome those who are interested in creating AI products that can contribute to global O2O(Online to Offline) services. Please feel free to reach out to us and show your interest! 😃