The Deluge — Perpetual Data

The Lowest Fruit On The Tree

Published in

Data Based

4 min readMar 15, 2024

Thirty-five years into a new age of personal printing presses transmitting data in near real-time has created a deluge of potential information. The data is crude. It is raw. It is messy, poor on hygiene, weak on format, and lacking in structure — to put it kindly. The sources are speculative, subjective, and disparate. It is a “Sea of Bullshit”. Be careful not to drown.

The printing press led to the scientific revolution — but not overnight. What The Deluge will bring is entirely a matter of speculation, but the challenges and opportunities it presents are already present. Let’s begin by working through a few of them.

The Perpetual Data Landscape

The World Wide Web is an endless landscape of data with fairly open access provided by the Internet. It is a multi-faceted landscape with an equally endless number of entities broadcasting streams of data. It is also an ecosystem, where technologies like search engines and social media actively ingest and rebroadcast. Once a land of mostly human activity, it is now a realm of bots, GAI, and even talking toasters (sorry the Internet of Things has been slow to get its feet under it).

Also within this ecosystem are a group of data collectors. It started with web scrapers, that led to bots. Some were renamed crawlers and spiders. Some were considered friendly. Open Source and API technology led to more and more companies selling data access. You make a few dollars, control the data flow, and combat a good portion of the crawler volume. It is now easier than ever to collect endless data.

Challenges Of Leveraging Perpetual Data

This landscape is easier and cheaper to access — but it comes with new difficulties. If your team is not cognizant of these challenges, you will make up for low costs with time, volume, and re-work (more time and volume). It takes skill to harvest even the lowest hanging fruit.

Source

The data that you ETL from your transactional systems can be messy, irregular, and challenging. You are about to amplify that 10X, if not more (some of you work in some challenging systems). Each data source is going to have its own challenges, if not multiple sets. It relates to the purpose of each platform you visit/mine on the web. Many of them are multi-purpose platforms that have zero incentive NOT to commingle their data.

Analyzing the source of your data will be essential to better usage. You are going to have to contrast and compare the purpose and constraints of the source to your own. I hope someone prepared you to do that first ;) Source matters first, but there is more.

Perspective

Each source has its own purpose and perspective, again it may have several. Most internet platforms (today’s most likely sources of robust data) are ecosystems for individuals pursuing their own ends. They may be motivated by financial incentives, power & influence, simple vanity, or chaotic insanity. People are batsh!t — just saying.

You will need to analyze the perspective of various data streams, fields, and flows. Source will help — sometimes. It is also at this point in the process that you have a new decision. Is this source, with this perspective, aligned close enough to my own purpose to leverage? You will also take into account your constraints and the maturity of your data initiative.

In My Experience…

Order matters. Many a data source needed to wait weeks to months before we could ingest it in a way that benefited our understanding and decision-making. We will talk more of this later.

Structure

Full disclosure — if the data you found is in the right structure for you to leverage it easily, you are either dead wrong or incredibly lucky. Do yourself a favor and think it through a second time and with a second set of eyes. Structuring this data is almost always difficult. That is why it is the last step — if the source and purpose are weak, don’t waste your time here. It will get expensive in a hurry.

NLP, tokenization, and even some off-the-shelf AI (think ChatGPT4) could be very helpful here. A great Data Engineer doubly so! A dedicated Data Architect — now your fighting with advantage! This is a level of innovation, technical skill, and artistry that is not required otherwise. Fortunately, the core data team, the DEs and DA are rarely bespoke so centralization will help you scale quickly. Hire well.

All of this was just a teaser though… let’s go deeper into The Deluge. Follow me on Substack.

The Deluge — Perpetual Data

The Lowest Fruit On The Tree

The Perpetual Data Landscape

Challenges Of Leveraging Perpetual Data

Written by Decision-First AI