ML-powered restaurant review summarisation with cultural insights

Varun Ullanat
13 min readDec 14, 2023

--

This article was produced as part of the final project for Harvard’s AC215 Fall 2023 course.

Project Github Repo — https://github.com/hpiercehoffman/AC215_FlavorFusion/tree/main

Video — https://youtu.be/ZdaxhP82LCs

Team — FlavorFusion (Varun Ullanat, Hannah Pierce-Hoffman)

Table of Contents

  1. Introduction
  2. Data pipeline
  3. Model training and optimisation
  4. API and Frontend
  5. Deployment and Scaling
  6. Future work
  7. References

Introduction

Background and Motivation

FlavorFusion was born from a simple question. As an intercultural couple, we’ve often found that we have wildly different dining experiences at the same restaurant. A typical case is spice level — Varun (South Indian) thinks the food is incredibly bland, while Hannah (American) thinks it’s almost too spicy to eat. It’s hard to tell from restaurant reviews whether a given restaurant will be “American spicy” or “South Indian spicy”. However, we found that we tend to look for reviews by people from our own culture to give clues to what our own dining experience will be like. After all, food preferences are inherently linked to the cultural food someone grew up eating. Hence the question: What do people of ______ culture think of ______ restaurant?

Our app aims to help users from diverse cultural backgrounds find the perfect restaurant. By inferring cultural background from a reviewer’s first and last name, we create groups of reviewers* for each restaurant in the Google Local reviews dataset. We then use the PRIMERA multi-document summarization model to generate a different review summary for each cultural group.

*It’s important to note that our cultural classification by name is highly inexact, and only based on guesses made by a python package! If we were to roll out this app in production, we would add much more detailed cultural classification, and clearly show the level of uncertainty inherent in guessing someone’s cultural background from metadata.

Short summary

The goal of this project is to use a generative NLP model to gain insight into cultural differences in restaurant experience. We first identified a dataset of over 600 million Google business user reviews from which our model will generate summaries from. We also identified a labelled (with ground truth summaries) dataset comprising of about 60,000 product reviews in case we need to finetune our model. We used the PRIMERA model for multi-document summarization as our base NLP model. To stratify reviews by culture, we use the ethnicolr Python package that predicts ethinicity using the customers’ first and last names. We also manually label about 60 Google business user reviews in order to train the model in a few-shot manner. We will then use final fine-tuned PRIMERA model to generate a short summary of the reviews in each group. In each and every step, we used popular MLOps tools and workflows to ensure successful deployment. We visualize our work in Figure 1.

Figure 1. Workflow of our application

Data Pipeline

Datasets

We have two primary datasets, each of which will be used at a different stage of training.

  • The LSARS dataset contains 58,418 review-summary groups, which were generated from a large dataset of 153 million Chinese clothing reviews. The authors of this dataset grouped reviews by product, then used a form of text processing called aspect extraction to identify one summary review from each group. This dataset will be used for fine-tuning the PRIMERA model. We sourced this data from here.
  • The Google Local dataset contains 666 million reviews of 4.9 million U.S. businesses, representing 113 million individual users. This dataset was collected by scraping reviews from Google Maps. In our work, we will use a subset of reviews for businesses categorized as restaurants. This dataset will be used for few-shot fine-tuning (using a small number of examples with manually generated summaries). We sourced this data from here. For this project we limited ourselves to restaurants in Massachusetts.

The two raw datasets are downloaded and stored in respective Google Cloud Buckets.

Data preprocessing

To process the data into a format that can be used by our model training pipeline, we created a Docker container for each of the two datasets:

  1. preprocess_google container: This container reads 2GB of Google Local data and filters out non-restaurant businesses and low-quality reviews. The inputs to this container are the input file paths of the datasets downloaded from the Google Cloud Bucket, secrets file (contains service account credentials for Google Cloud) and the parameters for the preprocessing. The output from this container is a CSV file containing merged data for all businesses, where the columns are business name, location, and all reviews seperated by a special token (‘|||||’).
  2. preprocess_lsars container: This container is responsible for preprocessing and translating the LSARS dataset, which is in Chinese. We use the Google Cloud Translate API to translate the text to English. This API provides free translation for the first 500,000 tokens, but after that, the price is $20 per 1 million tokens. We authenticate to the Cloud Translate API via our project’s service account. The inputs for this container are the following: Input filepaths (must be mounted in docker volumes), preprocessing parameters and the secrets file. The output of this container is a CSV file containing translated data. Since the data is large, we translate it in chunks and produce a CSV file for each chunk. We maintain the train-test split from the original dataset.

Data Labelling with Label Studio

We use Label Studio to manually summarize a set of reviews from the Google Local dataset. This is to run few-shot inference on a trained model to improve model performance. We used the Natural Language Processing > Text Summarization setup and connected to the dataset in Google Cloud such that output files are written directly to the bucket.

Data versioning

We use DVC as our data versioning pipeline. Our DVC configuration has one remote for each dataset. The remote for LSARS data is called lsars-data, and the remote for Google data is called google-data.

Model training and optimisation

Model summary

We used the PRIMERA model for multi-document summarization as our base language model. The implementation of this model is imported from Hugging Face with a PyTorch backend. The model has been pretrained and finetuned on the multinews, which aims to generate a summary from a bunch of news articles about a particular event. More information is provided in the model card here. We then train this model on LSARS dataset and do few-shot inference using the Google Local dataset.

Training scripts and container

We created a ‘train_serverless’ container to for training the PRIMERA model on the LSARS dataset in a serverless manner. The inputs are the 5,000 preprocessed and translated LSARS data points along with the training parameters and a secrets file. We use a train-test split of 0.95-0.05. The output of this container is a file containing trained model weights. The model is trained using a Cross-Entropy loss. The summaries evaluated during inference on the test set are characterised using the ROUGE score metric.

Experiment Tracking

We use Weights and Biases to track model performance, log evaluation metrics and save the final model.

Serverless Training Pipeline

We use the serverless training option on Vertex AI to run the training jobs. For our serverless training runs, we weren’t able to use the Vertex AI prebuilt containers as they don’t support PyTorch 2.0.1. Therefore, we used Google’s Artifact Registry to host a custom container for serverless training. When a serverless training run is triggered, our custom container is deployed, and the training script automatically runs with the provided input arguments. Using a custom container allows us to ensure that our PyTorch version and other package versions are exactly what our code needs.

To set up Artifact Registry, we first created an Artifact Registry repository within our project following this documentation. We then followed this documentation to tag and push our custom container to the repo we created. The container must be CUDA-capable and must have the training script as an entrypoint. Installing CUDA in a Docker container is a non-trivial task, so we used a prebuilt base container with PyTorch 2.0.1 and CUDA 11.8 pre-installed. We then added our files and required packages. The base container we used can be found here.

We use a script to submit our serverless training jobs. The script reads from a config file which specifies the following parameters:

  • Machine type (we use g2-standard-4)
  • GPU type (must be Nvidia L4)
  • Number of GPUs
  • URI of our custom container on Artifact Registry

In addition to the config file, the job submission script also contains the following parameters:

  • Unique job identifier (randomly generated)
  • Job name (can be changed to describe the job)
  • GCP region (we have L4 GPUs available in us-east1 and us-central1)

Finally, the job submission script allows you to provide arguments to the training script, which will be run when the serverless job begins. After the serverless job is submitted, you will be able to see it in the Custom Jobs console. The final model after training is saved on Weights and Biases as mentioned before.

Model optimisation

We chose pruning for our optimization technique. Pruning optimizes a trained model by removing weights which are close to zero. We wanted to use pruning to reduce the model size and speed up model inference. Pruning can also reduce overfitting in some cases, which would help our model generalize to new inputs during inference. To benchmark the results of pruning, we pruned 50% of weights in fully connected and output layers. We then calculated compressed model size, inference time, and accuracy metrics on a Nvidia T4 GPU and show the results in Figure 2.

Figure 2. Performance of our model on heldout set in zero-shot, finetuned and finetuned + pruned settings

Few-shot inference

To perform few-shot inference we created a new script in the ‘train_serverless’ container. The inputs are 60 preprocessed and labelled Google review data points along with the training parameters and a secrets file. We use a train-test split of 0.95–0.05. The output of this container is a file containing trained model weights. The model is trained using a Cross-Entropy loss. The summaries evaluated during inference on the test set are characterised using the ROUGE score metric.

API and Frontend

Application Design

We created design documents showing the high-level architecture of our app. We created a solution architecture which shows the high-level strategy of our entire project, as well as a technical architecture which provides implementation details about how the different components of the project work together.

Figure 3. Solution architecture

Our solution architecture as shown in Figure 3 depicts the flow of processes (tasks performed by developers and users), execution(code running in different parts of the project pipeline), and state (stored objects and artifacts). In this view of the project, we abstract away technical details.

Figure 4. Technical architecture

Our technical architecture as shown in Figure 4 provides a detailed view of the project structure, including components responsible for different actions and communication between these components.

Frontend

We implemented a prototype frontend app using HTML and Javascript. The app shows a simple front page where the user can select a restaurant from a dropdown menu. The dropdown menu is populated based on data downloaded from our GCS bucket. The user can click the “Submit” button to generate a summary of reviews stratified by predicted cultural background. The frontend UI is given in Figure 5.

Figure 5. UI of our application

We also include a Swagger API testing interface in the prototype front-end, so we can easily test our APIs. The entire architecture for this is captured in the ‘frontend-simple’ docker container.

Backend API

We used FastAPI to create backend RESTful APIs which handle communication with the frontend. We implemented the following APIs:

  • /: Default API call. GET method which returns a welcome message.
  • /populate: GET method which downloads a subset of our data from a GCS bucket and extracts a list of restaurant names. Restaurant names are then used to populate a dropdown menu in the frontend.
  • /predict: POST method which runs model inference for a selected restaurant, generating a summary of reviews from the selected restaurant stratified by the estimated cultural background of reviewers.

In addition to testing our APIs by interacting with the frontend, we also used Swagger’s API testing kit to verify that all APIs are working correctly. We implemented this using a docker container named ‘api-service’.

Application screenshots

We provide screenshots of our application for completeness. Figure 5 already showed the starting page. Figure 6 shows when the user selects a restaurant from the dropdown menu.

Figure 6. Loading page

In Figure 7 we see the final output of our application with predicted summaries stratified by estimated cultural background.

Figure 7. Results after model inference

Deployment and Scaling

Deployment using Ansible

We used Ansible to automate the process of deploying our app on a GCP Virtual Machine (VM) using a deployment docker container. We followed the following steps to do this:

  1. Run the deployment container by setting up secrets file, ssh authentication, and inventory.yml
  2. We then build the frontend-simple and api-service Docker images and push them to Google Container Registry. We tag both images with the current timestamp so they can be linked together. Running this command for the first time may take ~30 minutes, since the api-service container is fairly large and takes a long time to push to GCR.
  3. We then create a GCP VM with a mounted persistent disk for storage. We allow HTTP traffic on port 80, and HTTPS traffic on port 443
  4. We copy our secrets files (GCP credentials and WandB key) to the VM so they can be mounted to our Docker containers. We pull the containers from GCR and run them on port 3000 (frontend) and port 9000 (API service).
  5. We use Nginx as a reverse proxy to handle incoming HTTP traffic on port 80. Nginx also passes traffic between the frontend and the API service. To do this, we create a new ‘nginx’ docker container.

After running these steps, our application is accesible at using your favourite browser using the external IP of the VM.

Scaling using Kubernetes

We deploy our app as a Kubernetes cluster to effectively handle concurrent requests from multiple clients. To create and launch a Kubernetes cluster, we ran the Ansible command within the deployment Docker container that does the following:

  • Create a Kubernetes cluster which autoscales to have either 1 or 2 nodes. Each node is a n2d-standard-2type VM with 30 GB of disk space.
  • Create a namespace for the cluster, and update the local Kubernetes configuration to recognize the newly created cluster.
  • Set up a Nginx ingress to manage external access to cluster services.
  • Create a persistent volume which can be used by the cluster pods for extra storage.
  • Create Kubernetes secrets to store our GCP credentials (used to access data in buckets) and our WandB key (used to download our trained model).
  • Deploy the frontend and API containers as services within each node of the cluster. Grant the API service access to the persistent volume as well as the two secrets.
  • Wait for the Nginx ingress service to come up. Output the Nginx IP address to the console so it can be used to access the cluster.

This playbook takes 5–10 minutes to run, since cluster creation takes a long time. Once the playbook is finished running, the deployed app can be accessed at http://<Nginx Ingress IP>.sslip.io.

Future work

  1. Automated Model Monitoring: We can implement automated monitoring systems to track the performance of our model over time by detecting model drift and retraining the model when the performance decreases.
  2. Feature Store Integration: We can explore integrating a feature store to manage and reuse features across different models. This can help in maintaining consistency in feature engineering steps and speed up the model development process.
  3. Data Quality Monitoring: Since our application heavily depends on user data, we can use robust data quality monitoring frameworks like WhyLabs.
  4. Hyperparameter Optimization: To improve our model, we can implement scalable hyperparameter optimization strategies, like Ray Tune or Optuna, to continually improve model performance.
  5. Model Explainability and Fairness Analysis: We understand that our model could be biased and so we need to implement tools and workflows for model explainability and fairness analysis.

References

  1. Xiao, Wen, Iz Beltagy, Giuseppe Carenini, and Arman Cohan. “PRIMERA: Pyramid-Based Masked Sentence Pre-Training for Multi-Document Summarization.” arXiv, March 16, 2022. http://arxiv.org/abs/2110.08499.
  2. Yan, An, Zhankui He, Jiacheng Li, Tianyang Zhang, and Julian McAuley. “Personalized Showcases: Generating Multi-Modal Explanations for Recommendations.” arXiv, April 6, 2023. http://arxiv.org/abs/2207.00422.
  3. Pan, Haojie, Rongqin Yang, Xin Zhou, Rui Wang, Deng Cai, and Xiaozhong Liu. “Large Scale Abstractive Multi-Review Summarization (LSARS) via Aspect Alignment.” In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2337–46. Virtual Event China: ACM, 2020. https://doi.org/10.1145/3397271.3401439.
  4. Chintalapati, Rajashekar, Suriyan Laohaprapanon, and Gaurav Sood. “Predicting Race and Ethnicity From the Sequence of Characters in a Name.” arXiv, July 7, 2023. http://arxiv.org/abs/1805.02109.
  5. Gu, Xiaotao, Yuning Mao, Jiawei Han, Jialu Liu, You Wu, Cong Yu, Daniel Finnie, Hongkun Yu, Jiaqi Zhai, and Nicholas Zukoski. “Generating representative headlines for news stories.” In Proceedings of The Web Conference 2020, pp. 1773–1784. 2020.
  6. Fabbri, Alexander R., Irene Li, Tianwei She, Suyi Li, and Dragomir R. Radev. “Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model.” arXiv, June 19, 2019. https://doi.org/10.48550/arXiv.1906.01749.
  7. Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. “BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension.” arXiv, October 29, 2019. https://doi.org/10.48550/arXiv.1910.13461.
  8. Zhang, Jingqing, Yao Zhao, Mohammad Saleh, and Peter Liu. “PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization.” In Proceedings of the 37th International Conference on Machine Learning, 11328–39. PMLR, 2020. https://proceedings.mlr.press/v119/zhang20ae.html.
  9. Beaton, Caroline. “Why You Can’t Really Trust Negative Online Reviews.” The New York Times, June 14, 2018, sec. Smarter Living. https://www.nytimes.com/2018/06/13/smarter-living/trust-negative-product-reviews.html

--

--