Attempted Designing a Data Architecture

Joana Owusu-Appiah
Women in Technology
6 min readJul 4, 2023
designed in draw.io

The selling point of data analytics to me was data visualization. The fact that I could arrange graphs and charts in a single frame to tell a story fascinated me. I got a data engineering internship role just when I was settling into regularly designing reports and dashboards. My first task was to develop a data architecture for the company’s operations.

Data architecture? Wait, what?!

Pinterest

I spent several days freaking out and several hours making sense of all the information I gathered from experts and Google searches. Two things stood out from all the resources: I needed a good understanding of the company’s operations, data requirements, and vision for becoming data-driven.

In this post, I will take you on my journey from gathering information to ultimately mimicking the data architecture templates I found online. Be my guest!

1. What is data architecture?
2. What makes up a data architecture?

Data Architecture

A company’s data architecture is a blueprint designed for managing a company’s flow of data. The layout describes how data is managed from collection, transformation, and distribution through to consumption. A good design will depend on the business, its vision, and its need to be data-driven, and hence, should support its short-term and long-term goals.

There is not a one-size-fits-all data architecture; however, these features run through:

Data collection

photo credit

This includes systems, methods, and applications from which data originates within a company. Data can be collected through forms, surveys, questionnaires, and observation, among other means. Data obtained at this stage are usually prone to errors and require cleaning to make them useful.

The data can be collected into an Excel sheet, in a CSV format, or SQL databases, and these collection points become data sources.

Data integration

Data integration involves all the processes needed to combine data from all collection points into a unified, centralized, and consistent format. At this point, the data is refined from the sources and transformed into a form that can be used by other professionals in the data food chain.

Through an ETL/ELT** process, data can be integrated. Extract, Transform, and Load (ETL) is an expanded form of ETL. ETL moves data from the source into a centralized destination storage point, which could either be a Data Lake or a Data Warehouse. (*Read on for the difference between the two.*)

  1. Extraction(E) involves moving data from the data source into a landing zone/schema. The landing zone is a temporary storage stage while the data awaits transformation.
  2. Transformation(T) processes ready the data for analysis. The transformation includes cleaning and manipulating the data into an ideal form fit for analysis.

Data cleaning activities include changing data format(eg. dates), deduplication (removing duplicate entries), and derivation (creating new columns from existing ones, e.g., generating a profit column from sales and cost prices columns).

NB: Transformation happens within the landing zone.

Data modeling also happens at this stage. Data modeling is the term for establishing connections (relationships) between data points. For example, assume separate datasets exist on both customer transactions and customer biodata but they both have a customer ID column, we can connect the two tables by joining them on the ID column.

data modeling creates connections between datasets such that we can access information across them. Picture from Power BI Sales sample data.

Essentially, during transformation, business rules are applied to the data to make them ideal sources for further analysis.

3. Loading(L) finally moves the data into the destination source, the data warehouse.

Have you wondered where all these processes happen and who moves the data from one phase to the other?

Apache airflow, Luigi ( a Python package), Apache Nifi, and several technologies exist that are used to schedule the automated ETL/ELT process. The engineer needs to schedule the process, and it will be executed within those time intervals.

Differences between a Data Lake and Data Warehouse

source

A Data Warehouse stores structured data, hence the need for transformation before loading. A Data Lake, on the other hand, stores all forms of data, from structured to semi-structured and unstructured data.

Examples of data lake-kind of data include video streams, social media sentiments, real-time readings from sensors, videos, pictures, etc.

**Data lake integration will go through the ELT (Extract, Load, and Transform) route and await transformation as and when a need for the data arises.

A blend of the two is ideal for companies that generate all forms of data and have no specified use for all of them in the interim.

Governance

source

Data is now stored in our warehouse or lake. The data contains sensitive information, and several people within the company may need certain aspects of the data to execute their job roles.

Imagine calling a customer care center of a bank and asking questions about your bank account. The customer care personnel might not have access to your bank account details because they only need to give you directions on specified bank processes without needing access to your account.

Data governance allows the sharing of information from the centralized repository to different stakeholders within a company, bearing in mind customer privacy, national standards and policies, rules, ethics, and compliance measures.

Data governance is an umbrella term that characterizes data access, data ownership, and traceability. It mainly seeks to establish accountability and authority in all aspects of data usage and storage.

Check out my post on data ethics if you want more information on data governance.

Data Consumption

image source

Data engineers, data scientists, company executives, investors, and data analysts may all have different use cases for the same batch of collected data. Data consumption is the phase where all stakeholders can ‘tap’ into the data reserves for their needs after being granted access. Data analysis on the structured data can be done through the use of business intelligence tools while machine learning models can also be built on the vast data collected within the data lake. This is the point when the data users get to ‘consume’ the data.

Specialized reports and dashboards can be prepared and generated for the executives and investors to inform them about the requirements and needs that drove them to establish a data architecture.

Bonus!

Data analysis can be descriptive, diagnostic, predictive, or prescriptive.

i. Descriptive: explores historical data, highlighting certain business happenings in the past.

ii. Diagnostic: answers the “why” questions after detecting anomalies and inconsistencies within the dataset. This could be achieved by exploring other datasets and finding correlations between the parameters under scrutiny.

iii. Predictive: gives a projection of an occurrence in the future. The analyst proposes actionable insights that can be implemented to meet the proposed future.

iv. Prescriptive: shows the company how it can leverage predicted outcomes.

A data architecture shows how data will be managed from entry to exit within a company. I designed the image below in Draw.io :D ( If you have not noticed, I love to design images in Canva. Especially in instances when I am unable to find images that carry the sentiments I want to communicate. Draw.io was an additional tool my friend introduced me to. It made the architecture look professional. I added the colors because the plain one was too monotonous).

high-level architecture designed to simplify the entire concept!

Resources

  1. People ( Data engineers, Data Analysts)
  2. IBM — What is a Data Architecture
  3. AirFlow Alternatives
  4. Data Architecture 101 for your business- Youtube
  5. Describes the data warehouse structure
  6. Amazon data warehouse notes

--

--

Joana Owusu-Appiah
Women in Technology

Writer (because i write sometimes)| Learner (because I...) | Data Analyst (because ...) | BME Graduate | Basically documenting my Life!