Predict on New Data using LSTM and Saved it to CSV
I actually wrote this to myself so I wouldn’t forget how to do it, but it would be awesome if you were having the same problem and looking for a solution!
So I have an LSTM model and a tokenizer that I trained in the previous training. You can check it here. Then I will make predictions on the data and save it in CSV.
Data Source
We use Google Play Store and App Store API for data sources, then the review data collected in Google BigQuery. We have a review from November 2020 — to November 2021 with a total of 1708492 data.
If you want to know how we get the data source, kindly check this link below:
Use this code to get the data from Google BigQuery:
Preprocessing
- Remove duplicate data
Before we go through the modeling, we will remove duplicate data. We kept the recent reviews in created_date, based on review_id (review_id is unique).
- Case folding
The lower casing is used to avoid misunderstanding by the machine to words that are the same but are considered different. For example, the word “shop”. “Shop” and “shop” are two similar words, but the engine may perceive them differently because one is capitalized, the other is not.
- Remove punctuations
The punctuation here has no significant meaning, so it needs to be removed.
- Stemming
Stemming is removing the suffix that is in the word so that it becomes the root word. Stemming is sometimes not a perfect way to convert a word to a root word but it is quite efficient to do. There are still very few libraries that can do stemming in Indonesian. One of the most famous in literature.
- Remove stopwords
These stopwords are words that are repeated and have no special meaning, such as conjunctions. The presence of stopwords will bring up meaningless topics.
- Delete documents that only consist of 1 word, because they do not contain meaningful topics in the document
- Formalization
Formalization is one step to change the word into a formal form / easy to understand. Here I use formalization to change the brand name into a word form that can represent the brand.
Tokenize the Data
Before loading the model, we separate the tokens in each document with the following code.
tokenizer_data = RegexpTokenizer(r'\w+')df_processed['value_tokenize'] = df_processed['review_formal_processed'].map(tokenizer_data.tokenize)
value = df_processed["value_tokenize"]
Load Model
Previously we have saved the model and tokenizer. The tokenizer needs to be loaded so that the data has the same size as during training.
Predict the Data
It’s time for us to predict the new data with the loaded model. The data that has just been tokenized is based on the tokenizer that has been previously loaded and then predicted using the model that has been loaded.
The result:
Convert Output to Understandable Label
If the result is like that, it will be difficult to know whether a document falls into the topic “a” or “b”. Therefore it is necessary to change the predictive output to be easy to understand.
Here I have 4 topics and create a dictionary for those topics.
Concat the Data
The resulting output will be in the form of a series. This series will be concat with the initial dataframe.
Save the Data
Save the data into CSV.
result.to_csv('data_test_labelled_20220219.csv', header=["app_name", "app_id", "alt_app_id", "review_id", "user_name", "user_image", "review", "rating", "thumbs_up_count", "app_version", "created_date", "reply", "replied_at", "platform", "review_processed", "review_formal_processed", "value_tokenize", "topic"], chunksize=100000, index = False, mode='a')
Here is my medium post which is similar to this post:
Thank you for reading!