Build a data architecture from 0

When theory meets the best practices

Published in

Ring Capital

5 min readDec 7, 2021

Datalake, on-premise structure, BI engineer, etc.: yes, there is a lot of data jargon to be afraid of. Nothing new on the planet you might say. Data has been in the headlines for a few years now, to the point that everything leads us to believe that we have fully mastered it.

If the subject is indeed quite well known, most companies still struggle to implement the foundations of a data architecture leading to an efficient business use. Let’s talk about this with Andy Barakat, Lead Data Engineer at Phenix. From his experience as a Data Analyst and Scientist at Stuart, Andy gives us valuable keys to harness data and make it a real business lever.

What is the data?

On paper, data is really cool. Once the wow effect is over, we still have to ask ourselves some questions to set up a data archi:

What is the source of the data? Do you have your own DB or do you use APIs to collect the data?
What is the use of it?
How can it be accessed?
What are the properties of the data? Raw or transformed data? Structured or unstructured data? Static or real time data?

Are you still with us? Good.

To start with, you really have to start from the need. In fact, the data architecture must be seen as a product in its own by itself. Who are the users of this architecture? You, internally? Your customers? What KPIs do you want to exploit from the use of the data? What are the business issues behind it?

If these questions seem abstract, think again. Choosing live versus historical data has a consequence not only on the choice of tools, but also on the business!

How to store data?

Once you have identified your data sources, the question of storage appears. You have several options:

store and manage your data on the cloud.
use your own servers;

At this point, you will tell us that there is not much new. However, this is a really important step, especially for governance and sovereignty issues.

By relying on the cloud, you can :

avoid managing the data storage infrastructure ;
be able to scale without altering anything;
be assured of 100% data availability and an automatically maintained system;
be able to connect to different data streams #interoperability

With an on-premise infrastructure, you have total control over the infrastructure which means more custom features, more transparency on costs but also more confidentiality!

Who does what in a Data team?

I might as well warn you, it is the data flow that creates the team members.

In a nutshell:

1/ the collection, storage and complex transformation of data is usually done by a Data Engineer in order to give it a meaning for the business. This is clearly the most important step because collecting “clean” data influences the whole chain of use of the data.

Phenix collects the unsold goods of supermarkets, EAN by EAN (product identifier, code). From a business point of view, this information is meaningless. Data Engineers need to cross this with databases of store items, call on Open Source databases (like OpenFoodFacts) or paid databases to enrich its information for business (product categorization: store department/ecological score/composition etc.)

2/ The modelization of data into business insights is done by a data analyst.

Provide better analysis to our clients to help them reduce food waste, by identifying types of products to give or sell with the right discount.

3/ Finally, the optimization of the data. This is where Machine Learning and AI come into play, but also the Data Scientist!

Hiring time!

When starting a data approach, is it necessary to recruit the whole team or can a versatile profile suffice?

It all depends on the seniority of the person! However, be careful: it is difficult to have one person implement the entire data architecture. Depending on the profile, it is possible that a CTO can play the role of Data Engineer or that a Data Analyst is also experienced enough to do so. As you can see, there are no real rules!

Focus on the Data Engineer: the core of the Data Galaxy

In broad terms, a Data Engineer analyzes all data streams and :

> builds the data streams to centralize it ;

> finds solutions to store it efficiently

> provides the tools to allow data analysts & scientists to query and explore the data

> implements processes for RGPD compliance.

The Data Engineer can also be called Product/ML/Platform (the title depends on the organization) Engineer. In a few words, it’s also a Data Engineer but more specialized. The Platform engineer are specialized in:

facilitating the connection between different tools such as CRM, ERP and proprietary infrastructures;
putting the data science algorithms into production.

There is also a hybrid profile which can be a real plus. Focused on modelizing data, the BI Engineer also creates tables to make it digestible from a business point of view, hence the BI for Business Insight.

What is data architecture?

Let’s get to the heart of the matter! A data architecture allows to :

> collect and aggregate all the data used;

> analyze and exploit this data;

> easily interface with third-party tools that require access to data;

> facilitate sharing between different actors.

The data path

Before arriving in a data lake, the data is extracted and can be transformed. At this stage, it may or may not already be transformed. So we have layers of raw and transformed data.

The data house is called the data warehouse. It is roughly the final destination before use.

An interesting action, reverse ETL, is really worth exploring. It allows you to send the transformed data back to the sources. In other words, if we take the example of Phenix, the churn score is injected into the Intercom tool to develop communications with customers. We can therefore reactivate data with a high-performance stack!

You will have understood. Data is not only a tech story. To activate it from a business point of view, the whole team must be involved!

And you, what is your data organization?