Data Ingestion (Incremental Data Ingestion)

Oladayo
CodeX
Published in
8 min readMar 15, 2023

Hi Everyone,

Welcome to the second part of the four-part series on data ingestion. You can read the first part here.

In this post, I will be going over the incremental data ingestion approach and demonstrate the approach by ingesting data from a table in MySQL (source system) to a table in BigQuery (destination system) using an incremental data ingestion script (a python script)

Photo by Aaron Jones on Unsplash

Incremental Data Ingestion

In an incremental data ingestion approach, any time the incremental data ingestion script runs, it reads only new data records that have been added to the specified table in a MySQL database since its previous run. The new data records are then appended to the specified table in a BigQuery dataset.

Trying to use this approach means that there has to be;

  1. A timestamp column in the specified table in a MySQL database. The timestamp for each data record row should be the time the data record was added to the table.
  2. A table in BigQuery that stores the latest timestamp in the source system table on each run of the script. Every subsequent run of the script overwrites the previous table. The data stored in this table will be used in reading only the newly added (incremental) data records in the specified table in MySQL.

Now, let’s demonstrate the incremental data ingestion in steps;

*Note: I have already dropped the database, tables and dataset we used in the full data ingestion demonstration.

  1. Let’s create a database named ‘demo’, select the database and create a table named ‘orders’ with three columns (orderId, orderValue and orderTimestamp) in the database in MySQL.
creating database demo and orders table inside the database

We created the database named demo, selected the database and created the orders table with three columns named orderId, orderValue and orderTimestamp in the database.

The orderId column is of the integer data type and can not have a missing record. The first record starts from 1, while each new record after that automatically increments on the previous record by 1. orderId is also the unique identifier for the table.

The orderValue column is of the float data type. The column can take integer and decimal numerical values as records.

The orderTimestamp is of the TIMESTAMP data type. By default, the value for each data record is the timestamp at which data was added to the table orders. If an update is made to a data record, the orderTimestamp column changes to the timestamp when the update was made.

As we can see, when I query the table, it returns an empty set, meaning no records exist.

2. Let’s insert some random values into table orders.

inserting random values in the orders table and verifying it

We have inserted five records into the orders table. When I query the table this time, it returns the records in the table.

3. Let’s create a dataset named demo in BigQuery

creating dataset demo in BigQuery

4. Let’s create a table named latest_timestamp in the demo dataset in BigQuery. The latest_timestamp table stores the latest orderTimestamp data or value in the orders table in MySQL.

creating latest_timestamp table in the demo dataset in BigQuery

5. Next is to run the ingestion data ingestion (python) script. The script has an explanation of what each cell does.

6. Let’s check the demo dataset in BigQuery to see if we have a table named orders in it. If yes, let’s query the table

orders table with the ingested data in BigQuery

As shown above, the orders table has been created and querying the table displays the ingested data records.

We have successfully ingested data from the orders table in MySQL to the orders table in BigQuery.

7. Let’s also check the latest_timestamp table to see if there is any data in it.

latest_timestamp table with the latest orderTimestamp data stored in it

Querying the data in the table shows we have the latest orderTimestamp data or value stored in it.

Next is to see what happens when we make changes to the orders table in MySQL. Three (3) SQL syntax account for such changes; INSERT, UPDATE and DELETE.

Let’s demonstrate that one after the other.

  1. Let’s add new records to the orders table in MySQL.
updated orders table in MySQL with 2 new records

In the snapshot above, we inserted two new records with orderId 6 and orderId 7.

2. Run the incremental data ingestion script as before.

3. Let’s check to see what’s in the orders table in BigQuery

orders table with the ingested data (updated) in BigQuery

As shown above, the orders table in BigQuery has data records that exactly match the updated orders table in the source system.

In incremental data ingestion, new records (created by using INSERT syntax) are captured during the incremental data ingestion script run and the records are appended to the specified table in the destination system.

4. Let’s check to see what’s in the latest_timestamp table in BigQuery

latest_timestamp table with the updated latest orderTimestamp data stored in it

As shown above, the latest_timestamp table has the orderTimestamp data that matches the orderTimestamp for orderId 7 (latest orderTimestamp)in the orders table in MySQL.

Let’s demonstrate when updates are made to existing data records in the orders table in MySQL.

  1. Let’s update two existing records in the orders table in MySQL.
updated orders table in MySQL with two updates made

In the snapshot above, I updated the orderValue for orderId 1 and orderId 5.

2. Run the incremental data ingestion script as before.

3. Let’s check to see what’s in the orders table in BigQuery.

updated orders table in BigQuery

In the snapshot above, the orders table in BigQuery doesn’t exactly match the orders table in MySQL.

We have duplicates for orderId 1 and orderId 5 which are the rows that were updated in the orders table in MySQL. I will talk about a fix later.

Let’s demonstrate when a delete is made to an existing data record in the orders table in MySQL.

  1. Let’s delete a record in the orders table in MySQL.
updated orders table in MySQL with 1 deletion made

In the snapshot above, I deleted the row record for orderId 3.

2. Run the incremental ingestion script as before.

3. Let’s check to see what’s in the orders table in BigQuery.

updated orders table in BigQuery

In the above snapshot, the orders table in BigQuery still has a row record for orderId3. This means the delete operation that was carried out on the orders Table in MySQL is not reflected in the orders Table in Big Query.

In summary, the INSERT operation works fine, the UPDATE operation created duplicates and the DELETE operation is not reflected in the destination table.

How does one fix this? new additions will be made to the incremental data ingestion script as follows:

  1. read the incremental data records in the orders table in MySQL into a DataFrame (orders_mysql_df).
  2. read the orders table in BigQuery into a DataFrame (orders_bq_df).
  3. read the full data records for the primary key column (orderId) in the orders table in MySQL into a Series (ordersId_mysql).

To fix the UPDATE operation causing duplicates in the destination table, one will have to:

  1. check if the primary key (orderId) data records of orders_mysql_df are in the primary key (orderId) data records of orders_bq_df.
  2. If yes, update the orders Table in BigQuery for such primary key with the new data records in orders_mysql_df.
  3. If no, append such primary key data records in orders_mysql_df to the orders table in BigQuery.

To fix the DELETE operation not reflecting in the destination table, one will have to:

  1. check to see if the primary key (orderId) data records of orders_bq_df are in ordersId_mysql.
  2. If yes, do nothing.
  3. If no, it implies that such primary key records has been deleted in the orders table in the source system. Therefore one has to delete such primary key data records in the orders table in BigQuery.

Making these changes leads to a modified incremental data ingestion script.

Now let’s see the modified script in action.

  1. I have truncated the orders table in MySQL and BigQuery.
  2. Let’s insert the same values as before into the orders table in MySQL with new orderTimestam values.
orders table in MySQL with the same values but different orderTimestamp as before

3. Run the modified incremental data ingestion script

4. Let’s check the data in the orders table in BigQuery

orders table in BigQuery

The orders table in BigQuery has exactly the same data as the orders table in MySQL. We have successfully ingested data from the orders table in MySQL to the orders table in BigQuery using the modified incremental data ingestion script.

  1. Now, let’s insert new data records, update existing data records and delete a data record in the orders table in MySQL
updated orders table in MySQL

As shown in the snapshot above, we inserted two new data records with orderId 6 and orderId 7, updated orderValue for orderId 1 and orderId 5 and deleted the data record for orderId 3.

2. Run the modified incremental data ingestion script

3. Let’s check the data in the orders table in BigQuery

updated orders table in BigQuery

The updated orders table in BigQuery has the same exact data as the updated orders table in MySQL.

There are no duplicate data records and a deleted row record in the orders table in MySQL reflected in the orders table in BigQuery.

In summary,

Incremental data ingestion is supposed to have better performance when compared with full data ingestion but that only works with INSERT operation.

When one has to deal with UPDATE and DELETE operations in incremental data ingestion, it introduces a whole new complexity where one has to do primary key matching between the table in the source system and the table in the destination system which in turn defeats the better performance expected.

In the next part of this series, I will write about the third approach, the change data capture.

Thank you for reading.

References

Data Pipeline Pocket Reference by James Densmore (Get it here)

--

--

Oladayo
CodeX
Writer for

data 📈, space 🚀🛰, augmented reality 👓 and photography 📸 enthusiast.