Predict on New Data using LSTM and Saved it to CSV

3 min readFeb 19, 2022

I actually wrote this to myself so I wouldn’t forget how to do it, but it would be awesome if you were having the same problem and looking for a solution!
So I have an LSTM model and a tokenizer that I trained in the previous training. You can check it here. Then I will make predictions on the data and save it in CSV.

Data Source

We use Google Play Store and App Store API for data sources, then the review data collected in Google BigQuery. We have a review from November 2020 — to November 2021 with a total of 1708492 data.

If you want to know how we get the data source, kindly check this link below:

Running Airflow in Docker

Background Before we begin to configure Airflow using Docker we must first understand why we need to use workflow…

dionricky.com

Scraping App Review Data for Analysis

Background Mobile or portable devices are the most widely used technology of this era. With the increase of internet…

dionricky.com

Data Warehouse Architecture for Mobile App Review Analysis

Background In the early days of business, there is only one kind of database, transactional database. As the business…

dionricky.com

Use this code to get the data from Google BigQuery:

Preprocessing

Remove duplicate data

Before we go through the modeling, we will remove duplicate data. We kept the recent reviews in created_date, based on review_id (review_id is unique).

Case folding

The lower casing is used to avoid misunderstanding by the machine to words that are the same but are considered different. For example, the word “shop”. “Shop” and “shop” are two similar words, but the engine may perceive them differently because one is capitalized, the other is not.

Remove punctuations

The punctuation here has no significant meaning, so it needs to be removed.

Stemming

Stemming is removing the suffix that is in the word so that it becomes the root word. Stemming is sometimes not a perfect way to convert a word to a root word but it is quite efficient to do. There are still very few libraries that can do stemming in Indonesian. One of the most famous in literature.

Remove stopwords

These stopwords are words that are repeated and have no special meaning, such as conjunctions. The presence of stopwords will bring up meaningless topics.

Delete documents that only consist of 1 word, because they do not contain meaningful topics in the document
Formalization

Formalization is one step to change the word into a formal form / easy to understand. Here I use formalization to change the brand name into a word form that can represent the brand.

Tokenize the Data

Before loading the model, we separate the tokens in each document with the following code.

tokenizer_data = RegexpTokenizer(r'\w+')df_processed['value_tokenize'] = df_processed['review_formal_processed'].map(tokenizer_data.tokenize)
value = df_processed["value_tokenize"]

Load Model

Previously we have saved the model and tokenizer. The tokenizer needs to be loaded so that the data has the same size as during training.

Predict the Data

It’s time for us to predict the new data with the loaded model. The data that has just been tokenized is based on the tokenizer that has been previously loaded and then predicted using the model that has been loaded.

The result:

Convert Output to Understandable Label

If the result is like that, it will be difficult to know whether a document falls into the topic “a” or “b”. Therefore it is necessary to change the predictive output to be easy to understand.
Here I have 4 topics and create a dictionary for those topics.

Concat the Data

The resulting output will be in the form of a series. This series will be concat with the initial dataframe.

Save the Data

Save the data into CSV.

result.to_csv('data_test_labelled_20220219.csv',   header=["app_name", "app_id", "alt_app_id", "review_id", "user_name", "user_image", "review", "rating", "thumbs_up_count", "app_version", "created_date", "reply", "replied_at", "platform", "review_processed", "review_formal_processed", "value_tokenize", "topic"], chunksize=100000, index = False, mode='a')

Here is my medium post which is similar to this post:

Topic Modeling: Latent Dirichlet Allocation on Review Indonesia E-commerce Dataset

Latent Dirichlet Allocation (LDA) is included as unsupervised learning. LDA looks for hidden groupings in the data…

medium.com

Topic Classification: Review on Indonesia E-commerce Dataset (TF-IDF and Logistic Regression vs…

In the previous story, I have shared how to make a topic from a review dataset using Latent Dirichlet Allocation, but…

medium.com

Thank you for reading!

Predict on New Data using LSTM and Saved it to CSV

Data Source

Running Airflow in Docker

Background Before we begin to configure Airflow using Docker we must first understand why we need to use workflow…

Scraping App Review Data for Analysis

Background Mobile or portable devices are the most widely used technology of this era. With the increase of internet…

Data Warehouse Architecture for Mobile App Review Analysis

Background In the early days of business, there is only one kind of database, transactional database. As the business…

Preprocessing

Tokenize the Data

Load Model

Predict the Data

Convert Output to Understandable Label

Concat the Data

Save the Data

Here is my medium post which is similar to this post:

Topic Modeling: Latent Dirichlet Allocation on Review Indonesia E-commerce Dataset

Latent Dirichlet Allocation (LDA) is included as unsupervised learning. LDA looks for hidden groupings in the data…

Topic Classification: Review on Indonesia E-commerce Dataset (TF-IDF and Logistic Regression vs…

In the previous story, I have shared how to make a topic from a review dataset using Latent Dirichlet Allocation, but…

Written by Katarina Nimas Kusumawati