Blibli Future Program Batch 5 — Data Track Phase 2: Data Engineering

About the project

Vincent Junitio Ungu
9 min readFeb 18, 2022

In this project, we are required to extract reviews (comments) from Google Play Store and Apple Store. With these reviews, we will then create a dashboard and perform topic modeling using deep learning.

The reviews considered in this project are the top 10 e-commerce apps cited from Selular.id, which are Tokopedia, Shopee, Bukalapak, Lazada, Blibli, Orami, Ralali, Bhinneka, JD.ID, and Zalora.

Data Engineering

First, we created a data pipeline using Airflow 1.10.12 and Python 3.6 to scrap the review data. We used Docker to install and access the Airflow. The DockerFile can be accessed here meanwhile the docker-compose.yml can be accessed here. After creating the DockerFile and docker-compose.yml files, we created three bash scripts, init.sh, run.sh, and destroy.sh.

init.sh is used to initialize the Airflow within the Docker.
run.sh is used to run the Airflow.
destroy.sh is used to destroy the Airflow, usually when there is any update in the DockerFile.

The flow to access the Airflow is as follows.

After the Airflow is accessible, we can start creating the Directed Acyclic Graph (DAG). We define our scraped data into four layers: source layer, staging layer, data warehouse layer, and data mart layer.

1. Source layer

This layer contains every review extracted from Google Play Store and Apple Store. In this layer, we left the review as real as it is.

Scraping Google Play Store

To scrap the Google Play Store reviews, the Python library, google_play_scraper, was used. In this library, the method, reviews_all(), is used to extract all reviews of an app from the Google Play Store. There is also another method, reviews(), which can be used to extract the reviews of certain conditions. We created a custom operator to implement these methods, google_play_store_review_operator.py.

The __extract_date() function is used to keep the data of/up to a certain date. The execute() function is used to execute the data extraction in this operator. If the job is a data migration job (is_migrate equals to True), the reviews_all() method will be used. On the other hand, if the job is a data extraction job (is_migrate equals to False), the reviews() method will be used. The reviews() method is slightly different from the reviews_all() method. Notice that there is a continuation_token in line 46 when we called the reviews() method. In Google Play Store, the maximum number of reviews per page supported by Google Play is 200. Thus, it is designed to paginate and recrawl by 200 until the number of results reaches the count. Therefore, we initially sorted the data by the newest (with the Sort method) and extract 200 reviews. If the last review’s date is still in the desired date, we scrap another 200 reviews until the last review’s date is not in the desired date. This scraping will continue its last scrap with the help of this continuation_token. When the scraping was completed, we saved the data in JSON format and keep it in our local directory. Then, we uploaded the JSON files (ten files for ten applications) into Google Cloud Storage and loaded them to Google BigQuery. The DAG scripts can be accessed here for the migration job, and here for the extraction job.

Scraping Apple Store

To scrape the Apple Store reviews, we used the RSS link. With the same concept as scraping the Google Play Store reviews, we extracted the reviews up to a certain date and kept the data for migration. However, the maximum number of reviews supported by this RSS is 500 reviews (total). Therefore, we extracted these reviews daily with the assumption that the reviews of an app will not exceed 500 per day. The daily data was extracted by filtering the reviews in which the date was in the desired date. We created a custom operator to implement the data extraction for Apple Store, apple_store_review_operator.py.

The __get_json() function will read the RSS link. The __get_reviews() function will extract all the reviews in the RSS link from page 1 to page 10 (with the maximum number of reviews per page being 50). The __extract_date() function is used to keep the data of/up to a certain date. The execute() function is used to execute the data extraction in this operator. When the scraping was completed, we saved the data in JSON format and keep it in our local directory. Then, we uploaded the JSON files (ten files for ten applications) into Google Cloud Storage and loaded them to Google BigQuery. The DAG scripts can be accessed here for the migration job, and here for the extraction job.

Since data migration is the process of storing data into a database as its initial data, the schedule for these data migration jobs is @once. For the data extraction jobs, the extraction is conducted daily at 12 AM (UTC) or 7 AM (GMT+7).

The Graph View (Airflow) to extract, upload, and load reviews from Google Play Store and Apple Store into BigQuery.

To sum up the DAGs in the Source layer, four DAGs and two operators have been created to extract the reviews from Google Play Store and Apple Store.

2. Staging layer

In this layer, the freshly extracted data (in the Source layer) was combined and processed. We created three tables in the BigQuery for this layer.

Tables in Staging layer (BigQuery)

The first table is called staging_apple. In this layer, we combined those ten Apple reviews files (1 file per e-commerce) into a single table. The timestamp a user wrote a review was in UTC-7. Therefore, we converted it to UTC / GMT + 0 by adding 7 hours. We also added a new attribute review_source to tag that this table was the combination of all the reviews in the Apple Store. We also added a dag_execution_timestamp to denote the execution time for this task.

The second table is called staging_google. In this layer, we combined those ten Google Play reviews files into a single table. The timestamp of this review has been UTC so we left it as it was. We added review_source and dag_execution_timestamp just like staging_apple’s.

After both tables have been created, it is time to combine these two tables into one, we named it staging_reviews.

Let’s run through this code. We have two subqueries here, all_reviews and ranked_reviews. First, we matched the attributes' names between Apple’s and Google’s, then we combined them with the UNION ALL operator and saved it into all_reviews. The second subquery, ranked_reviews, kept all the reviews from all_reviews and tagged the reviews based on the partition review_id ordered by its review timestamp in descending order (the review with the latest review timestamp will rank 1 within the same review_id). Finally, I selected the reviews with the rank 1 (latest) and joined them with the earliest review timestamp within a review_id (this was used to denote the date the review was created).

Just in case the explanation above is not cleared enough, here’s an example.
Suppose that I wrote a review of an application in Google Play Store. The review_id was gp:001 with the review timestamp was 2022–02–01 01:00:00 UTC. On the next day, I edited my review on 2022–02–02 02:00:00 UTC. On the next day again, I edited my review on 2022–02–03 03:00:00 UTC.

In my database, the records would look like below.
review_id, review_timestamp
gp:001, 2022–02–01 01:00:00 UTC (first)
gp:001, 2022–02–02 02:00:00 UTC
gp:001, 2022–02–03 03:00:00 UTC (latest)

After the staging, my reviews would look like this. (Note that I only showed the attributes that related to date data since I processed these attributes to keep the unique review_id).
review_id, created_timestamp, updated_timestamp
gp:001, 2022–02–01 01:00:00 UTC, 2022–02–03 03:00:00 UTC

This staging layer runs daily two hours after the source data had been extracted. To fully ensure that all the data had been extracted, we implemented ExternalSensor in the DAG. The DAG file can be accessed here.

The Graph View for the staging layer

At this point, there was no duplicated review_id anymore. Thus, I continued to the next layer, which is the data warehouse layer.

3. Data warehouse layer

If you have ever heard of the Extract-Transform-Load (ETL) process, we did that in the staging layer. In this layer, we simply created the fact table and dimension tables by cherry-picking the related attributes from the table in the staging layer.

Data warehouse Schema
Data warehouse schema

The fact table is the main table. It contains numerical information (facts) as well as keys that connect to the dimension tables. Therefore, let’s start by creating dimension tables.

To create dim_user, we selected the user_name and user_image from staging_reviews. Then, we created a surrogate key user_key by hash functioning the concatenation of user_name and user_image. Next, we have dim_app.

To create dim_app, we selected the app_name and review_created_version from staging_reviews. We created a surrogate key app_key by hash functioning the concatenation of app_name and review_created_version. Then, we have dim_review.

To create dim_review, we selected the review_id, content, reply_content, and review_source from staging_reviews. Since review_id has been unique, we simply made this key as its primary key, review_key.

The other two dimension tables, dim_time was created manually with Excel meanwhile the dim_date was created by using the format prepared by the Kimball Group. We have covered all the dimension tables and of course, we have figured out all the keys for each dimension table. Now, let’s create the fact table.

In this table, we joined all the dimension tables to the staging_reviews table and selected the numerical information as well as the keys in each dimension table.

This DAG runs daily after the staging layer has been successfully executed. We also implemented ExternalSensor to ensure that the data was ready before creating the fact table and the dimension tables. The DAG can be accessed here.

The Graph View for the data warehouse layer

Since we have created the fact table and the dimension tables, we extracted some attributes for dashboarding purposes and topic modeling purposes into a data mart.

4. Datamart layer

In this project, we created two data marts. The first data mart was to keep all the reviews from Google Play Store in 2021. The DAG can be accessed here.

We joined the dim_user, dim_app, dim_date, and dim_time to the fact table and extracted some attributes for our dashboarding purposes. We filtered the data by selecting the source from Google and the date when the review was created was in 2021 (GMT+7). Since the date attribute was in UTC, we converted it to Asia/Jakarta (GMT+7) since we will be presenting this data in our timezone. We also added another attribute called polarity to mark which review has positive, negative, or neutral reviews by using the CASE WHEN operator. We kept this data into a table named datamart_all_reviews_google. This DAG runs daily but we stopped the job up to a certain date as the boundary date.

The Graph View to create datamart_all_reviews_google

The second data mart was to keep all the reviews from 1st October 2020 until 1st October 2021 for topic modeling purposes. The DAG can be accessed here.

In this table, the main focus was the review content. Therefore, we extracted the content and some other attributes from the fact table and dimension tables. We filtered the data when the updated date (in GMT+7) was 1st October 2020 and October 2021. We kept this data into a table named datamart_oct20_oct21. This DAG runs daily but we stopped the job up to a certain date as the boundary date.

The Graph View to create datamart_oct20_oct_21

Here is the end for the data engineering part. Whew, it’s a bit long but we still have the data analysis and data science part :) so let’s go.

--

--

Vincent Junitio Ungu

An ambitious, passionate, and determined young learner interested in data analysis, data science, and artificial intelligence.