The Data Engineering Chicken & Egg Problem

Published in

SSENSE-TECH

4 min readAug 22, 2019

How To Navigate The Data Lake Roadmap

DILBERT © Scott Adams. Used By permission of ANDREWS MCMEEL SYNDICATION. All rights reserved.

Stop me if you’ve heard this before: an engineer attempts to get requirements in order to start a task, but the stakeholder involved wants to know what the product will be able to do before giving the requirements. This story is perfectly captured by the above Dilbert comic. In particular, it focuses on the software industry. However, this can also be applied to the data ecosystem, where data itself is the product being delivered, and data engineers are the ones building the pipelines to support that.

In this context, a typical conversation might go something like this: a data engineer discusses the on-going data lake efforts with a stakeholder and attempts to understand the requirements for a data driven feature, the stakeholder replies wondering what data they can explore and then the data engineer responds by saying that any data can be obtained but that it’s just a matter of figuring out what specifically is needed.

This type of miscommunication highlights a few problems faced by data engineers. These will vary depending on your own organization and its level of maturity in data and analytics.

When you are launching a data lake initiative, there’s lots of groundwork to be done in terms of making sure that data pipelines can be automated, tested, and deployed into a production environment. They will then need to scale and handle much of the data generated by your system’s architecture. At SSENSE, we have an event based microservice architecture and therefore data pipelines need to handle spikes in website traffic, and the ensuing data that is generated. In parallel, there needs to be a clear direction as to which kind of data pipelines to prioritize.

One of the main goals of the data engineering discipline is to support data science and its initiatives. As I’ve previously discussed, one of the main roles of a data engineer is to produce clean, consistent, and easily accessible data for the relevant stakeholders, namely data scientists. Moreover, the data science workflow involves the key step of asking the right questions. Therefore, it would seem only natural that once the right questions are asked, a specific need for data will arise, and with that, a clearer understanding of what specific data is required.

And so, perhaps the engineer is right after all, the chicken comes before the egg! Yet… there is also a different perspective where the egg comes first. For instance, organizations that are in their nascent period in terms of data lakes and similar data initiatives must be aware of the different value of their different data sources. Any data that can be directly translated to a dollar amount must take precedence. Therefore, in a retail environment, any data relating to the goods being sold is paramount, whether it be related to the orders, shipments, receipts, or the inventory itself.

In situations such as those described above, it’s less about asking the right questions and more about having the minimum viable data in order to even be in a position to ask these questions. This step is akin to a data scientist performing some exploratory data analysis prior to formulating a hypothesis. So perhaps the stakeholder who is asking for data is right after all, and the egg does come first.

Of course, the reality of the situation will vary greatly depending on which phase of building out a data ecosystem your organization finds itself in. As a data engineer, the requirements might not always be clear at inception. However, if your data pipelines are built with good software engineering practices, high test coverage, and a focus on user-accessibility, handling changes in requirements becomes a lot easier.

It’s very important to give your end-users exposure to the data as soon as possible because you need their feedback in order to make sure your product is having its intended impact. If you wait too long, you may accumulate crippling technical debt, and that could ultimately lead to a failed project. On the other hand, by having your stakeholders involved on a use case basis, you can incrementally build trust and produce value by delivering relevant and high-quality data.

Editorial reviews by Deanna Chow, Liela Touré & Prateek Sanyal

Want to work with us? Click here to see all open positions at SSENSE!

The Data Engineering Chicken & Egg Problem

How To Navigate The Data Lake Roadmap

Written by Hussein Danish