Data Engineering Headwinds

Published in

Adfolks

4 min readSep 15, 2021

Few things which stagnate the data platform modernization.

Data engineering is one of the most indivisible yet important layers in the software stack. Software engineers around the globe developed great pieces of software for data engineering purposes. Different companies have different data software needs. Now, technology advanced to the point where one can build their data platform in minutes (thanks to the cloud vendors). Still, various problems stagnate the development of the data stack.

“Modern tool comes with Modern problems”

TL;DR
Stagnating factors are:
* Use-case definitions: No proper use cases prepared for Data.
* Data Interfaces: Not having definite data interfaces to extract.
* Data Invisibility: Lack of global data catalog and governance.
* Tools: Not having streamlined tools.

Some of the few key problems that some of the enterprise companies of multiple industries faces are:

Note: All the items given below may or may not stagnate your current data engineering development but these are my common findings.

Use cases

Yes, you read it right! use-cases!. So many enterprises lack a use-case development team in-house. This is very critical that each company should have their own Data analytics or Data Science department that builds analytical/ data scientific uses cases for their own business needs. Also, this team should work closely with both business and Data engineering teams. The business team will set the priorities right for each use-case and map the use-cases with different channels of their business.
Also, there should be a tight collaboration with the Data engineering team, so that, the data engineering team will enhance the decision support systems to serve the use-cases.
Another important aspect of use-cases is revisiting them time-to-time. Our business data may grow unpredictably. So revisiting and enhancing them would be so much rewarding. Enhancement of the use-cases may include such as converting batch-wise use-cases to real-time or near real-time use cases.
Machine learning is becoming a norm in every industry to increase efficiency and enhance the process. So start defining small ML use-cases which may help you increase the process efficiency internally or may help you serve your customers better.
The main factor which stops the use-case development is the visibility of the data to the analytical team/DS team so we need to have a powerful organization-wide data catalog.

Data Interfaces

Every company consists of different processes and applications. These applications may be legacy (old ones) and modern ones. If you have modern stack-based applications then you would not face much of a problem extracting the data. But if we have some 10 to 20-year-old software, then extracting the data from those systems is a bit of a hassle.
The hassle of extraction is inevitable but the extracting difficulty should be repeatable. In other words, the extraction of data from the old legacy system should be abstracted using the service layer [if it does not have]. These will be acting as the data plugs where you can extract the data by calling all services you need. So this will reduce writing so many steps again and again in Data engineering ETL pipelines.
Data plugs could be built in multiple ways like push and pull mechanisms. Push-based data plugs can trigger when an event occurred and send the data to the destination processing systems. This would eventually reduce the time and effort of writing ETL pipelines.
One bonus benefits of abstracting the extraction layer with APIs are that it would give more control over access management using API management tools.
All these abstracted data plugs should have descriptions and data profiling done so that the consumers will know the health of the data.
When the organization grows, then we will have our data marketplace so that inter departmental use-cases in the organization shall be fulfilled smoothly [less zoom calls *wink*].

Data Invisibility

Currently, many organization does not have any visibility over their data lakes/data warehouses/data stores.
Central data cataloging systems are required which could give visibility not only to one vertical but also to the entire organization itself.
Having a Better Cataloging system would enhance the use-case definition process.
On top of the Data catalog, there has been a proper data policy should be in place.
Finally having an organization-level data catalog will help to achieve better data unification like unique customer identification across organizations.

Tools

The main problem in tooling is that you have so many tools, not only that, it is too many great tools, so many options would cause confusion and block data integrations between systems.
Create tool practices in your organization for each purpose, like mapping tools based on the data extraction type such as batch, stream, APIs, etc.
This would help to streamline the process of the human resource development process.

Conclusion

What I have mentioned above are some of the few problems which are caused by technology and needs. But there are more points apart from technical drivers such as organization policies, structure, etc. So, as a large organization there should be a proper base for the software development stack embracing changes.