Data Engineering 101: Introduction to Data Engineering

Jacintasally
5 min readAug 21, 2022

--

A comprehensive guide to data engineering

Data engineering is the process of designing, building, and maintaining systems that collect, store, and process data. It is a branch of computer science and engineering that combines software engineering, database design and management, systems engineering, and network engineering. Data engineering is a critical part of data science, as it ensures that data is collected, stored, and processed in a way that is efficient, reliable, and scalable. Without data engineering, data science would not be possible.

Relationship between Data engineering and Organisational Data
Data Engineering Visualization by Burtch Works

Why Data Engineering

By 2025, the Internet Database Connector(IDC) projects that the digital data created worldwide will be approximately 163 zettabytes². Data engineering is a critical function in today’s data-driven world. Organizations of all types and sizes are now relying on data to make decisions. Data engineering is the key to ensuring that this data is of the highest quality and is available when and where it is needed. As more data is being generated, so does the demand for data engineers increase.

Data Engineer Tasks

In the course of their work, data engineers will perform the following functions:

  • Design, build and maintain data processing systems
  • Developing and managing data warehouses
  • Data mining and analysis
  • Working with big data to develop systems and tools to process and analyze large data sets.

Data Engineer Salary

Glassdoor estimates the base pay of a data engineer to be approximately Ksh 200,000 per month in Kenya¹.

The Data Engineering Learning Path

The following are the core topics a data engineering enthusiast should master:

1. Programming- Programming is a fundamental skill for data engineers as most of the tasks performed by data engineers rely on writing programming scripts. Learning python is highly encouraged because it is widely used in many industries today. Python is easy to understand for beginners and has many modules and libraries that can be used for various tasks, including data wrangling, data analysis, machine learning, and deep learning. Lastly, python is commonly used for developing scripts.

2. Scripting and Automation- In data engineering, scripting and automation refers to the process of automating the creation and maintenance of data pipelines. This can include tasks such as provisioning resources, configuring settings, and deploying code. It can also include more complex tasks such as monitoring data flows and managing data quality. Here learners should focus on the basics of scripting languages such as Ruby or Python and how to use automation tools such as Puppet. Additionally, one should learn how to integrate automation into the data engineering workflow. Lastly, how to troubleshot and debug automation scripts.

3. Relational Databases and SQL- Relational databases and SQL are the fundamental technologies for storing and querying data. In order to learn how to effectively use these technologies, data engineering enthusiasts need to understand the following concepts:

  • The basics of relational databases, including how to structure data in tables and how to query data using the SQL language.
  • The basics of SQL, including how to select data, how to insert and update data, and how to use SQL functions and operators.
  • How to design efficient and effective database schema, including how to normalize data and how to choose appropriate data types.
  • How to optimize SQL queries for performance, including how to use indexes and how to write efficient SQL code.

4. NoSQL Databases and Map Reduce

There is a lot to learn in NoSQL databases and Map Reduce in data engineering. However, here are some key things to focus on:

  • How NoSQL databases work and their key features.
  • How to design data models for NoSQL databases.
  • How to query NoSQL databases using Map Reduce.
  • How to optimize Map Reduce jobs for performance.
  • How to troubleshoot and debug Map Reduce jobs.

5. Data Analysis- There are a few key things to learn in data analysis when working in data engineering. Firstly, it is important to understand the basics of statistical analysis and how to use various tools to effectively analyze data. Secondly, it is also beneficial to learn how to effectively visualize data so that it can be easily interpreted. Finally, it is also important to be familiar with the different types of data that can be collected and stored in order to effectively engineer data solutions.

6. Data Processing Techniques- There are a few key things to learn in Data Processing Techniques for data engineering:

  • Batch Processing: This is a process where data is processed in batches, typically on a schedule. This can be used to process large amounts of data efficiently.
  • Building Data Pipelines: This involves creating a system to efficiently move data from one place to another. This is often done using ETL (Extract, Transform, Load) tools, e.g, Hevo.
  • Debugging: This is a process of finding and fixing errors in data processing systems. This can be done using tools like Hadoop or Spark.

7. Big Data- The most important thing is to learn how to effectively use the tools available to manage and process large data sets. The most popular tools for this purpose include Hadoop, HDFS, MapReduce, Spark, Hive, and Pig.

8. Workflows- There are a few key concepts that are important to learn in order to create efficient and effective data engineering workflows. These include understanding how to extract, transform, and load data (ETL), as well as how to create and use data pipelines. Additionally, it is important to have a solid understanding of data warehousing and how to optimize data storage and retrieval.

9. Infrastructure- In infrastructure, data engineering refers to the process of designing, building, and maintaining data infrastructure. This includes the data warehouse, data lakes, data marts, and data pipelines that are necessary to support data-driven applications and analytics. Data engineers are responsible for ensuring that data is accessible, reliable, and scalable. They work with data architects to design and build data infrastructure, and with data scientists to optimize and tune it for performance.

10. Cloud Computing-Cloud computing is a way to use technology to make it easier for businesses to work with large amounts of data. It allows businesses to store data in the cloud, which is a network of computers that can be accessed from anywhere in the world. Some key things to keep in mind include understanding how to use cloud-based data storage and processing services, as well as how to manage and monitor cloud-based data systems. Additionally, it is important to be familiar with the different types of cloud computing architectures and how they can be used to support data engineering workloads.

Practice Makes Perfect

Remember that practice makes perfect as you undertake your journey to be a data engineer. At each step of the learning journey take on projects to build your skills and your portfolio. Be kind to yourself when you get stuck, everyone’s journey is different!

References

  1. Glassdoor. (2022). Salaris: Data Engineer in Nairobi, Kenya. https://www.glassdoor.com/Salaries/nairobi-data-engineer-salary-SRCH_IL.0,7_IM1085_KO8,21.htm
  2. IDC. (2020, February 25). IDC Predictions. Data Ideology. https://www.dataideology.com/data/by-2025-idc-predicts-that-the-total-amount-of-digital-data-created-worldwide-will-rise-to-163-zettabytes-ballooned-by-the-growing-number-of-devices-and-sensors/

--

--