Predict on New Data using LSTM and Saved it to CSV

Katarina Nimas Kusumawati
3 min readFeb 19, 2022

--

I actually wrote this to myself so I wouldn’t forget how to do it, but it would be awesome if you were having the same problem and looking for a solution!
So I have an LSTM model and a tokenizer that I trained in the previous training. You can check it here. Then I will make predictions on the data and save it in CSV.

Data Source

We use Google Play Store and App Store API for data sources, then the review data collected in Google BigQuery. We have a review from November 2020 — to November 2021 with a total of 1708492 data.

If you want to know how we get the data source, kindly check this link below:

Use this code to get the data from Google BigQuery:

Preprocessing

  • Remove duplicate data

Before we go through the modeling, we will remove duplicate data. We kept the recent reviews in created_date, based on review_id (review_id is unique).

  • Case folding

The lower casing is used to avoid misunderstanding by the machine to words that are the same but are considered different. For example, the word “shop”. “Shop” and “shop” are two similar words, but the engine may perceive them differently because one is capitalized, the other is not.

  • Remove punctuations

The punctuation here has no significant meaning, so it needs to be removed.

  • Stemming

Stemming is removing the suffix that is in the word so that it becomes the root word. Stemming is sometimes not a perfect way to convert a word to a root word but it is quite efficient to do. There are still very few libraries that can do stemming in Indonesian. One of the most famous in literature.

  • Remove stopwords

These stopwords are words that are repeated and have no special meaning, such as conjunctions. The presence of stopwords will bring up meaningless topics.

  • Delete documents that only consist of 1 word, because they do not contain meaningful topics in the document
  • Formalization

Formalization is one step to change the word into a formal form / easy to understand. Here I use formalization to change the brand name into a word form that can represent the brand.

Tokenize the Data

Before loading the model, we separate the tokens in each document with the following code.

tokenizer_data = RegexpTokenizer(r'\w+')df_processed['value_tokenize'] = df_processed['review_formal_processed'].map(tokenizer_data.tokenize)
value = df_processed["value_tokenize"]

Load Model

Previously we have saved the model and tokenizer. The tokenizer needs to be loaded so that the data has the same size as during training.

Predict the Data

It’s time for us to predict the new data with the loaded model. The data that has just been tokenized is based on the tokenizer that has been previously loaded and then predicted using the model that has been loaded.

The result:

Convert Output to Understandable Label

If the result is like that, it will be difficult to know whether a document falls into the topic “a” or “b”. Therefore it is necessary to change the predictive output to be easy to understand.
Here I have 4 topics and create a dictionary for those topics.

Concat the Data

The resulting output will be in the form of a series. This series will be concat with the initial dataframe.

Save the Data

Save the data into CSV.

result.to_csv('data_test_labelled_20220219.csv',   header=["app_name", "app_id", "alt_app_id", "review_id", "user_name", "user_image", "review", "rating", "thumbs_up_count", "app_version", "created_date", "reply", "replied_at", "platform", "review_processed", "review_formal_processed", "value_tokenize", "topic"], chunksize=100000, index = False, mode='a')

Here is my medium post which is similar to this post:

Thank you for reading!

--

--

Katarina Nimas Kusumawati

Sometimes I struggle with data, sometimes I just wanna be a Pikachu