The evolution of working with Data

Celman Elden D. Sudaria
ATCP Spark
Published in
7 min readJan 10, 2021
Hieroglyphs (Photo by Lady Escabia from Pexels)

We have been working with data since the first time humans started to jot down the number of people in their tribe on a cave wall or to record how many day and night cycles before a full moon re-appears. Fast forward a bit and ancient empires have started to record the volume of their harvest and the tax money they collect from their subjects and provinces.

In the exercise of recording facts i.e. data, humans saw patterns such as the recurring number of days when the full moon appears and then they used this cyclical pattern as their (lunar) calendar to determine when to best harvest their crops before a flood comes.

You can say that humans have done “some form of analytics” a very, very long time ago and in a way, you are correct.

Analytics 1.0 — Business Intelligence
Over the years, humans got sophisticated… we were able to build computers that can do very fast calculations and at the same time store vast amounts of data in virtual storage called databases. In these databases, we devised a structure that allows us to store data effectively and to read data efficiently. This gave rise to the concept & construct of a data warehouse (DW), operational data store (ODS) and data mart.

With a data warehouse, we are able to aggregate and analyze structured data (sales, profits, etc.) in different dimensions. These analysis and aggregations need to be communicated to different audience and reports or dashboards are the common ways to do so. This gave rise to Business Intelligence (or BI). BI is very effective and allowed enterprises & individuals to know their current & historical performance by helping them answer the question: “what happened?” or “what is happening?”.

Business Intelligence is still and will continue to be a very good approach to analyzing structured data.

Analytics 2.0 — Big Data
As the Internet grew, new digital services like social media, internet of things & the sharing economy came about together with the proliferation of mobile devices & wearable devices; and these combination of services & technologies generated vast amounts of data. Businesses realized the value of their own data and even more, when it is enriched by data external to their organization so they also started to store & analyze data. This fed into the challenge to harness Big Data and it gave rise to the concept & construct of a data lake.

With a data lake, data can be stored in its raw format for exploration and for advanced analysis of business analysts and data scientists. Different use case-based schema can be built using the different data from different sources. The BI tools leveraged in Analytics 1.0 are still very effective tools in Analytics 2.0 and they are enhanced & complemented by data exploration tools and analytics web applications that can use predictive models available via APIs. Complex event processing (CEP) solutions can also provide interesting new services and use cases for streaming data.

Analytics 3.0 — Any Data Anytime, Anywhere
Fast forward to today and most enterprises have now embraced the value of data for growing and differentiating their organization but some have been disillusioned by failed data lake implementations. Somehow, business users are not able to use the data in the data lake because they do not know what data is in the data lake or they are not able to search for the data they need and even if they are able to, they do not know what data is correct to use.

In recent years (i.e. from 2018 to 2021), several approaches, technologies and architectures have been introduced to address these challenges and to maximize the opportunities brought by leveraging Cloud-based platforms and solutions. One of these is leveraging the concept and construct of a data hub.

In a data hub, data from different sources across the organization is integrated and ‘harmonized’ into a hub. Then, additional components for metadata capture & management, for data discovery (e.g. data catalog) & indexing and for advanced analytics are added to allow business users and data scientists to search for data or information or report they need to do their work.

The diagram below shows a sample of the common building blocks of a data hub.

Generic building blocks of a data hub

With a data hub, business users, data analysts and data science practitioners are better enabled to do their jobs more effectively and more efficiently because data is “discoverable”, data can be queried faster because of indexing and there is an environment where data science practitioners can explore and “experiment” with the data.

Another architecture pattern that is gaining popularity recently is the data lakehouse. According to databricks, “A data lakehouse is a new, open data management paradigm that combines the capabilities of data lakes and data warehouses, enabling BI and ML on all data.

In many of the data platform projects in Cloud I have worked on, this data lakehouse architecture pattern (i.e. there will be a data lake and then a data warehouse) keeps on occurring. We use data lake to load data in its native form and without any changes and then for relevant and applicable data (usually structured and enterprise data), we create a schema in the data warehouse depending on the use case. This is the concept of schema-on-read applied.

Generic sample of data flow in a data lakehouse pattern

Then, there is data mesh and its four principles i.e. (1) domain-oriented decentralized data ownership and architecture, (2) data as a product, (3) self-serve data infrastructure as a platform, and (4) federated computational governance. While other data architectures are advocating centralized data in one form or another, data mesh is recommending to de-centralize data among other things.

It is building upon and improving on the existing limitations of other data platform architectures that requires data to be centralized. While data mesh is promoting domain-oriented decentralized data, it balances that with having a core/foundational data infrastructure as a platform. More importantly, data mesh puts the right ownership and responsibility to the right people in an organization or at least tries to. To read and discover more about data mesh, you can click this link.

In a way, in the “age of any data anytime, anywhere”, we now have several data platform architectures and paradigms to consider and leverage; and as an architect, we need to be able to determine the solution or architecture that best fit or that best address the requirements or use cases.

The Future Looks Promising…
We have indeed come a long way from writing data on cave walls and there is still more to learn from the data that the world is generating every second. This will not change albeit we will all continue to generate data in this digital or post-digital world.

As the generation and availability of data continue to grow, there will be an increase of data sharing use cases, requirements and work loads. In this context, a network of data sources and data platforms will become available that need to be secure and that allow any user to search for data and information they need to do their work.

At the same time, there are already promising research and applications of quantum analysis which you can read about via this link to MIT News. Some quantum computing services are now even available and this will further revolutionize and evolve the way we work with data.

Cool stuff! To me, these are all great opportunities to all data practitioners!

Disclaimer: All views expressed on this story are my own and do not represent the opinions and viewpoints of any entity or organization that I have been, am now, or will be affiliated.

This story has been published for information and illustrative purposes only and is not intended to serve as advice of any nature whatsoever. The information contained and the references made in this story is in good faith, neither my employer nor its any of its directors, agents or employees give any warranty of accuracy (whether expressed or implied), nor accepts any liability as a result of reliance upon the information including (but not limited) content advice, statement or opinion contained in this paper.

This story also contains certain information available in public domain, created and maintained by private and public organizations. I do not control nor guarantee the accuracy, relevance, timeliness or completeness of such information. This story constitutes a view as on the date of publication and is subject to change.

This story makes only a descriptive reference to trademarks that may be owned by others. The use of such trademarks herein is not an assertion of ownership of such trademarks by me or my employer nor is there any claim made to these trademarks and is not intended to represent or imply the existence of an association between me and the lawful owners of such trademarks.

--

--

Celman Elden D. Sudaria
ATCP Spark

A Data Architect with over 20 years of experience in Data Architecture, Data Management & Data Engineering. https://ph.linkedin.com/in/celmaneldendsudaria