Data Engineering — An Introduction

rindangchi
Nerd For Tech
Published in
5 min readMay 5, 2024

Understanding Data Engineering at A Glance

image source : https://k21academy.com/microsoft-azure/dp-203/data-engineering/

In recent times, the concepts of Data Engineering and Data Engineer have gained significant popularity, alongside data analysis and data science. Many in the field of data management are currently engaging in discussions about data engineering, asserting that there is a higher demand for data engineering roles compared to data analysis and data science. It is argued that the data engineering process plays a pivotal role in overall data management.

What is Data Engineering and Who is Data Engineer?

Data Engineering is a process that includes development, research, and construction of data pipleline which transforming raw data into high quality, consistent, and reliable dataset that is suitable for analysis by end users.

Data Engineer, in the other hand is people who execute the data engineering process. Below are common tasks of data engineer:

  1. Acquire data set and align with the business needs.
  2. Design and develop data pipelines to do the Extract, Transform, and Load (ETL) process from various data sourced.
  3. Implement and maintain database to store data effectively.
  4. Collaborate with management to understand company and business objectives.
  5. Ensure data quality and and integrity through data cleansing process.
  6. Monitor and throubleshoot data pipeline issue.
  7. Collaborate with data analyst and data scientist to understand their neeeeds and requirements for further analysis.
  8. Implement security measure to ensure protection of sensitive data.

To effectivelly perform the following tasks below skills are need to be acquired by data engineers:

  1. Programming Skills : Python, Java, SQL.
  2. Data Warehousing : Amazon Redshift, Google BigQuery.
  3. Database Management : PostgreSQL, MySQL, MongoDB, Cassandra.
  4. ETL Tools : Apache Spark, Apache Airflow, Talend.
  5. Data Modelling : ER Diagram, Database Schema skills.
  6. Big Data Technologies : Spark, Hadoop.
  7. Cloud Platforms : AWS, Google Cloud Platform, Azure.
  8. Version Control : Knowledge in version control system such as Git.

In addition to the mentioned technical skills, data engineers should have several soft skills such as communication skills because they will collaborate with various stakeholders, and eager to learn because technologies in the data fields are always keep changing.

Data Pipeline

Data pipeline is a process to extract data from multiple resources, cleansing the data, then finally store the data into target system, typically database or data warehouse.

image source : https://weld.app/dictionary/data-pipeline

Data pipeline process includes following:

  1. Data Collection : Process of collecting data from various data sources, such as database, API, file system (csv, excel, etc), event. The data could be have different format.
  2. Data Cleaning : In this stage data is cleaned, the cleaning process involves checking the missing value, correcting any error, transforming data into consistent and reliable format. The goal of the cleaning process is to produce accurate and reliable data.
  3. Data Integration : Data from different sources are merged into one single dataset.
  4. Data Storage : The processed that is stored in the target system.

Those following process is commonly known as ETL process, Extract, Transform, and Load process. Extract : data collection, Transform : data cleaning & integration, Load : data storage.

image source : google.com

Extract : to extract data from multiple resources, data can be extracted from file, database, API, Web scraping, sensor data, third party.

Transform : change the data into format that is suitable with the target system. Data is transformed so that it has consistent structure and definition. Transformation process typically includes data cleaning, data aggregation, data filtering, joining data, data validation, data mapping, and calculations.

Load : finally data then loaded into target system, such as data warehouse, operational database, and many more.

ETL vs ELT

Besides ETL there is also a popular term called ELT, nowadays ELT also has been widely used by industries. Here is the difference between ETL and ELT.

Extract Transform Load (ETL)

  1. Extract data from various resources → transform the data → load data to target system.
  2. Data transformation is conducted in different server.
  3. Raw data is not sent to data warehouse.
  4. Ideal for small dataset that does not require complex transformation.
image source : https://rivery.io/blog/etl-vs-elt/

Extract Load Transfrom (ELT)

  1. Transformation process occurs after the load process.
  2. Data transformation is processed in the data warehouse.
  3. Sent raw data into data warehouse.
  4. Can handle a large amount of dataset.
image source : https://rivery.io/blog/etl-vs-elt/

ETL vs ELT process

image source : https://www.striim.com/blog/guide-to-data-pipelines/

Data Warehouse & Data Mart

Either using ETL or ELT process data will store in the data warehouse or also data mart. So what is different between data warehouse and data mart.

Data Warehouse is a centralized storage that is used by end user to peform data analysis, it can contain raw data or processed data. A data warehouse will have some databases.

Data Mart is a part of data warehouse, data mart is intended for specific user, for example data mart for sales & marketing team, human resources team, etc.

Data warehouse and data mart illustration:

image source : https://panoply.io/data-warehouse-guide/data-mart-vs-data-warehouse/

Here’s a brief summary of data engineering. I hope this article provides additional insights for readers, particularly those interested in learning about data engineering.

--

--