Data Engineering Project Quickstart Guide — Part 1

How to successfully kick off as a data engineer

Andreas Kretz
Plumbers Of Data Science
3 min readAug 30, 2022

--

You want to start with your very first data engineering project but you don’t know how? This quickstart guide will help you out! It leads you through all necessary steps and helpful tools to successfully implement your project — no matter on what platform.

In the first part you will learn everything about the data platform blueprint and its five main parts. This will help you understand better how a platform actually works.

The Data Science Platform Blueprint — Connect, Buffer, Process, Store and Visualize

Connect

The connection phase is the first phase where the data comes in. This could be a hosted API gateway where clients send data to. This data then flows through your platform.

You’ll also find here very often data integration tools that access existing (external) systems such as data warehouses, relational SQL databases or external APIs. They extract the data and feed it into your platform and pipelines.

Buffer

In the middle of the blueprint you have the main processing and storage functions of your platform.

If the source in the connection is constantly pushing data in then you need some kind of buffer to process it dynamically. That’s where message queues come in. They prevent what is called backpressure, where data is coming in faster than your processing can handle.

Buffers are also very good if you have multiple processings working with the same data. You feed the data into the queue once and multiple consumers can work with it in parallel.

Message queues help you to manage the flow of data within your platform without having to work with files.

Process

The processing framework is where things actually happen. This is where data is taken either from storage or from a message queue, processed and stored again.

It’s almost the most important part of the whole platform, because without a processing framework, nothing happens. You need a way to transform and analyze data, and that’s what it’s there for.

Within the processing framework, there are two different types of processing: stream processing and batch processing. Batch processing is where you find extract transform and load (ETL) jobs. After processing data goes into a storage.

Store

The store phase is where you put the data in and store it. Here you find all kinds of data stores: relational databases, NoSQL databases, data warehouses, data lakes. You will find the different types again further below organized into OLTP and OLAP data stores.

The stored data is typically either used again by the processing framework, for a visualization tool or API that serves data to a client.

Visualize

After storing your data, you need to visualize it. That’s where tools like web interfaces, business intelligence tools or monitoring dashboards come in. Something where actually a user is working with and visualizes the data.

Part 2 — Learn the two main types of data pipelines

So, start with the platform blueprint, get familiar with it and work your way through it. This way you can start and grow your very own data engineering project!

In the second part of my quickstart guide, I will explain to you the concept of OLTP (online transactional processing) and OLAP (online analytical processing) a bit further so you get a better understanding of the different use cases.

Learn Data Engineering

Also, I’ll give you an overview of which tools you can find in which part of the platform as well as a great selection of hands-on example projects, which you can all find in my academy.

You want to start right away? How about this exciting contact tracing project where you work with Elasticsearch: https://learndataengineering.com/p/contact-tracing-with-elasticsearch

More free Data Engineering content:

Are you looking for more information and content on Data Engineering? Then check out my other blog posts, videos and more on Medium, YouTube and LinkedIn!

--

--

Andreas Kretz
Plumbers Of Data Science

Data Engineer and Plumber of Data Science. I write about platform architecture, tools and techniques that are used to build modern data science platforms