What is Data Engineering?

Pubudu Dewagama
Tributary Data
Published in
4 min readNov 22, 2023
Photo by Markus Winkler on Unsplash

The data engineer will frequently work with many types of data to execute numerous operations utilizing a variety of scripting or coding languages relevant to their specific organization.

Types of data

Structured: Structured data is typically derived from table-based source systems, such as a relational database, or from flat files, such as a comma-separated (CSV) file. The main feature of a structured file is that the rows and columns are continuously aligned throughout the file.

Semi-structured: Semi-structured data is data like JavaScript object notation (JSON) files that may need to be flattened before being loaded into your source system. This data does not have to fit nicely into a table structure when flattened.

Unstructured: Unstructured data is data that is stored as key-value pairs and does not conform to typical relational models. Other common types of unstructured data include portable data format (PDF), word processor documents, and photographs.

Data operations

Some of the key activities you’ll undertake in Azure as a data engineer include data integration, data transformation, and data consolidation.

Data integration: Data integration is the process of creating connections between data sources and operational and analytical services to allow for safe, dependable access to data across various systems. For instance, a data engineer may be needed to create connections between disparate systems in order to extract the necessary data from each one in order to support a business process that depends on data that is dispersed among them all.

Data transformation: Although there is a growing variation in which you extract, load, and transform (ELT) the data, which is used to quickly ingest the data into a data lake and then apply “big data” processing techniques to transform it, operational data typically needs to be transformed into a suitable structure and format for analysis. Whichever method is employed, the data is ready to facilitate the requirements of subsequent analyses.

Data consolidation: The process of merging data taken from several data sources into a standardized structure is known as data consolidation, and it is typically done to facilitate analytics and reporting. Typically, operational systems’ data is taken out, processed, and added to analytical repositories like data lakes or data warehouses.

Key ideas in data engineering

Data engineers should be familiar with a few fundamental ideas. Many of the tasks that data engineers must implement and support are based on these ideas.

Operational and analytical data: Typically, applications generate and store transactional data, which is then saved in a relational or non-relational database as operational data. Data optimized for analysis and reporting is referred to as analytical data, and it is frequently stored in a data warehouse or data lake.

A data engineer’s primary duties include designing, implementing, and overseeing solutions that combine operational and analytical data sources or that take operational data and extract it from various systems, format it for analysis, and load it into an analytical data store (commonly called ETL solutions).

Streaming data: Perpetual data sources that produce data values in real-time, frequently associated with particular events, are referred to as streaming data. Social media feeds and internet-of-things (IoT) devices are common sources of streaming data.

In order to integrate real-time data with other application data that is handled in batches, data engineers frequently have to put into practice solutions that capture real-time streams of data and ingest them into analytical data systems.

Data pipelines:Data transformation and transfer operations are coordinated through the use of data pipelines. The main tool used by data engineers to create repeatable extract, transform, and load (ETL) solutions that can be set to run on a schedule or in reaction to events is a pipeline.

Data Lakes: A data lake is a type of storage repository where large volumes of raw, native data are kept. The scalability of data lake repositories for enormous volumes of data (terabytes or petabytes) is optimized. The data may be unstructured, semi-structured, or structured, and it usually originates from a variety of diverse sources.

Keeping everything in its unaltered, original state is the goal of a data lake. Compared to a standard data warehouse, which processes and transforms data as it is ingested, this method is different.

Data Warehouse: A central repository for integrated data from one or more different sources is called a data warehouse. Relational tables in data warehouses hold both historical and current data, arranged in a format designed to maximize efficiency for analytical queries.

Relational data warehouse design and implementation, as well as frequent data load management into tables, fall within the purview of data engineers.

The core Azure technologies used to implement data engineering workloads include:

Azure Synapse Analytics
Azure Data Lake Storage Gen2
Azure Stream Analytics
Azure Data Factory
Azure Databricks

Thank you !!!

--

--