Data Ingestion (Full Data Ingestion)

Oladayo
CodeX
Published in
5 min readMar 12, 2023

Hi Everyone,

This post is the first of a four-part series on data ingestion. The series is based on my reading of the Data Pipelines Pocket Reference Book by James Densmore.

This first part will introduce data ingestion and demonstrate the full data ingestion approach by ingesting data from a source (MySQL table) to a destination (BigQuery table) using a full data ingestion script (a python script).

Photo by Quinten de Graaf on Unsplash

Introduction

Data Ingestion means moving data from a source (e.g. a table in MySQL or PostgreSQL) to a destination (e.g. a table in a data warehouse such as BigQuery or Redshift).

Data Ingestion is an integral part of the Data Pipeline building Process. It is the Extraction and Loading (EL) part of the Extraction, Loading and Transformation (ELT) Data Pipeline Design.

End Users’ such as the analytics team, can then transform the ingested data to answer business data needs.

Now, let’s talk about the full data ingestion approach.

Full Data Ingestion

In a full data ingestion approach, each time the full data ingestion script runs, it reads the entire records in the specified table in a MySQL database and loads the table in a BigQuery dataset. If the same table exists before in the BigQuery dataset, the run replaces it.

Now, let’s demonstrate this in steps;

  1. Let’s create a database named ‘demo’, select the database and create a table named ‘orders’ with two columns (orderId and orderValue) in the database in MySQL.
creating database demo and orders table inside the database

We created the database named demo, selected the database and created the orders table with two columns named orderId and orderValue in the database.

The orderId column is of the integer data type and can not have a missing record. The first record starts from 1, while each new record after that automatically increments on the previous record by 1. orderId is also the unique identifier for the table.

The orderValue column is of the float data type. The column can take integer and decimal numerical values as records.

As we can see, when I query the table, it returns an empty set, meaning no records exist.

2. Let’s insert some random values into table orders

inserting random values in the orders table and verifying it

We have inserted five records into the orders table. When I query the table this time, it returns the records in the table.

3. Let’s create a dataset named demo in BigQuery. The dataset is the equivalent of the database in MySQL. Tables are stored in the dataset.

dataset demo that was created in BigQuery

We have created a dataset named demo in BigQuery using the Google Cloud SDK Shell.

4. Next is to run the full data ingestion (python) script. The script has an explanation of what each cell does.

5. Let’s check the demo dataset in BigQuery to see if we have a table called orders in it. If yes, let’s query the table.

orders table with the ingested data in BigQuery

As shown above, the orders table has been created and querying the table displays the ingested data records.

We have successfully ingested data from the orders table in MySQL to the orders table in BigQuery.

Next is to see what happens when we make changes to the orders table in MySQL. Three (3) SQL syntax account for such changes; INSERT, UPDATE and DELETE.

Let’s demonstrate that at once.

  1. Let’s add a new record, update an existing record and delete an existing record in the orders table in the MySQL server.
updated orders table in MySQL

In the snapshot, we inserted a new data record with orderId 6, updated the orderValue from 900.5 to 350.5 for orderId 1 and deleted the record for orderId 3.

Querying the table reflect all the changes we made.

2. Run the full data ingestion script as before.

3. Let’s check to see what’s in the orders table in BigQuery.

orders table with the ingested data (updated) in BigQuery

As shown above, the orders table in BigQuery has data records that exactly match the updated orders table in the source system.

In full data ingestion, every change or update to the table(s) in the source system is captured during the full data ingestion script run, and it stores the table(s) in the destination system while overwriting the previously stored table(s).

One of the most common data quality issues is duplicate records.

In the full data ingestion approach, if there are duplicate records in the target table in the destination system, It could mean that:

  1. the duplicate records are from the source system table.
  2. when ingesting the data to the destination system table, it was specified to append to a previous table in the destination system instead of replacing it.

In the case of 1., it’s best practice to implement data quality checks between the extraction and loading section in the full ingestion python script and monitor these checks.

The full data ingestion approach is simple to implement, as we have seen but less scalable for frequent changes to the source system tables.

In the next part of this series, I will write about the second approach, the incremental data ingestion approach.

Thank you for reading.

References

Data Pipeline Pocket Reference by James Densmore (Get it here)

--

--

Oladayo
CodeX
Writer for

data 📈, space 🚀🛰, augmented reality 👓 and photography 📸 enthusiast.