Leveraging Data Engineering for Data-driven Decision Making

Published in

CodeX

5 min readSep 8, 2021

Data engineering as a field has been gaining popularity in recent years, especially in large tech companies. It focuses on the harvesting and application of big data. In any field, an engineer designs and builds things. It is no different with data engineering: data engineers design and build too.

But unlike a civil engineer, for instance, who designs and builds physical infrastructure, a data engineer designs and builds data pipelines. These pipelines transport and transform data into a format that data scientists can understand. Data engineers use the data pipelines they create to extract data from many different sources and into a single warehouse from which data scientists can analyze it.

So, we can define data engineering as all the processes involved in designing and building pipelines that harvest, transport and transform big data to make it available to data scientists for analytics. Data engineers design, develop, build, test, and maintain large processing systems and databases. In most cases, they deal with unvalidated raw data with lots of errors, improving its quality, efficiency, and reliability using specific technologies and tools.

Ultimately, data engineering exists to provide consistent and organized data flows that enable data scientists’ data-driven work, including:

Exploratory data analysis
Machine learning (ML) models training

Data engineers create mechanisms and interfaces that support the access and flow of information. Data engineering revolves around:

System architecture
Database design and configuration
Programming
Sensor and interface configuration

The Force That Powers Data Science

Because of the nature of data engineering, data engineers don’t get much praise and recognition; they’re often behind the scenes and are uninvolved in praise-attracting tasks such as coming up with incredible insights from big data and querying big data sources. Yet, their work is central to the success of data scientists; they are the ones that build the data stores that make the incredible insights possible.

Data engineers design databases, infrastructure for data analytics, and data warehouses/marts/lakes. They formulate the queries that data scientists and other data users will use to get the information they need from data. Therefore, a competent data engineer must understand data structures, databases, cloud environments, and hardware infrastructure.

Because they deal with structured and unstructured data, data engineers must clearly understand the various approaches to data application and architectures. The data engineer’s toolkit includes open source data frameworks of data ingestion and processing.

Big technology companies produce petabytes of data and have had to develop ways of extracting helpful insights from their colossal volumes of data. Before they can ever get any insights from the data, they have first to find ways to store it reliably and process and query its inflows. Therefore, the data infrastructure must be reliable, distributed, and scalable.

Given the astronomical volume of the data involved, the tasks that the data engineers have to complete are in no way trivial in nature. Data engineers and data scientists make up the data teams that have become central parts of the technical teams in modern tech companies.

Inadequacy of Excel Spreadsheet Analytics

Data engineering revolves around taking data from a source and saving it to make it available to be analyzed. For small companies, tracking results in their CRM, Google Analytics, application database, and perhaps a few other tools might suffice. Their analytics data pipeline is relatively straightforward to manage, and Excel spreadsheet analytics is enough.

But over time (sometimes within months in fast-growing startups), the inadequacy of this pipeline becomes apparent. Now, the company finds itself wanting more insights from its data, especially as it adds more data sources and fields to track. It increasingly becomes necessary to create an automated system that can aggregate all the data from the different sources and avail it for analysis.

Automating Data Collection and Transformation

This is where data engineering begins; a data engineer comes in and automates this process of data extraction, transformation, and loading, creating an ETL (extract, transform, load) pipeline. The ETL pipeline extracts data typically using an API connection and transforms it to remove errors, change formats, map the same types of records to each other, validate that the data is okay, and finally, load it into a database like MySQL. The data engineer creates a script that makes it possible for this process to repeat itself every week or month.

The team can now access the data using business intelligence (BI) tools that represent the information in maps, vertical and horizontal bars, pie charts, etc. With convenient access to insights, the idea of leveraging data becomes entrenched into the company culture and the journey to becoming a data-driven enterprise begins.

The marketing team can now track the whole sales funnel from when a potential customer visits the company’s website to the moment they buy its product or service. On the other hand, the product team can explore customer behavior while the management team reviews high-level KPIs.

Creating a Data Warehouse

But this is often short-lived because standard ETL pipelines leverage transactional databases like MySQL that are not optimized for processing complex queries and doing lots of analytics. It becomes necessary to create a data warehouse where the data can be consolidated in an organized manner from the many different sources that the company gathers its data.

Creating a warehouse helps the company structure the data in meaningful ways for analytics purposes. Such a warehouse would be designed to run even the most complex analytics queries. Unlike using standard databases, creating a warehouse includes having several sit-downs and iterations with the different teams in the company before arriving at the best warehouse design.

Python’s Open-source Data Frameworks

In many companies today, Python data frameworks are the go-to way of creating data warehouses that connect to the different data sources, including BigQuery, Amazon S3, Snowflake, Amazon Redshift, Oracle, etc. In recent years, data scientists have shifted from using MATLAB in their research and analytics endeavors, instead preferring Python because of its excellent math libraries. By leveraging Python’s open-source data frameworks, companies can jumpstart their data engineering journey and become data-driven enterprises.