The Data Lake — Solving For Symptoms, Not Problems

Craig Yamato
FermiHDI
Published in
6 min readMar 31, 2023

Data Lakes, as an analogy, are aptly named. If you build out that idea, we can see them as a man-made recreational reservoir for fishermen and boaters around which developers have built homes and casinos. We only interact with their surface, and the town around them only exists to support the visitors to the lake.

Seemingly lost to time is why the reserver / Data Lake and the industry built around and atop it was created. Simply put, it was to address the cost of building performant enough databases. A Data Lake’s technical idea is to act as a giant buffer. It pushes the task of processing all that data at query time (basically when you read the data, which is the hard part) to a later step which is application specific and distributes the processing load. Even though databases have added new features such as TSDB or Graph and data organizational methods like document, wide column, or search, we have yet to address the issue that created the need for a Data Lake meaningfully, let alone that accounted for the rate of data growth, so the Data Lake has become the new core of data systems.

The Data Lake concept has had a significant impact on how we build data systems and the tools that use the data they provide. One of the largest of these impacts is where data at rest (stored data) is augmented and manipulated (Transformed). Modern applications, including microservices ones, no longer utilize the database for data manipulation, such as joining different tables together, but rather do this themselves and use multiple engines as part of their secret sauce. In fact, managing these types of services is what is marketed as a Lake House. This approach effectively separates a Data Stores persistence function from its Data Engine functions, making the latter an on-demand scalable function and reducing the performance requirement by making a single application specific.

The drawback to this is the potential for massive duplication as every application now must take on the extremely heavy processes of marshaling data files into highly structured data in application memory where it can be processed. For highly targeted queries that only access small known portions of the dataset, this is a fair trade-off, but when it comes to stored data that originated from a stream of data, the problem is even larger. Compounded over time, these massive datasets are measured in Terabytes, Petabytes, or even larger increments, which is why we refer to them as Hyperscale Datasets. Such data sets are often subjected to deep scans querying large portions, if not the whole dataset, at a time. Common examples of such queries are building AI Training Datasets, ML Baselines, anomalous and micro patterns discovery, data exploration, and almost every step in building a production streaming process or data-driven automation. Such scale shoves even the Data Lake concept back into the same problem we had with databases; the cost for the performance needed to work with them in a reasonable time frame is unserviceable even when buffered by the Data Lake.

Solving this problem seems simple if we ignore a few inconvenient facts and apply the idea of “store only the data you need.” I am sure you have heard the marketing phrase a million times. Why this statement is nonsensical is a topic for a whole other series of posts, but suffice it to say it is impossible to know what you need. You can only know what you think you need at the time. The result is that the value of what you save extends no further than how you intended to use it when you saved it. Even still, this has spawned whole new industries of tools to, well, to be frank, reduce and condense the data we store, mostly by trashing 98% of it. Moreover, this has generated such a complex toolchain, including before and after the Data Lake, that the latest hot topic in Data Systems is tools to manage and automate the tools in the data pipeline.

To demonstrate how this is implemented and its effect, let’s look at a network telemetry application such as observability and or cybersecurity based on this architecture. Even small corporate networks produce a lot of data in the form of a few thousand Metrics (such as interface in and out bytes) every five minutes or so, parsed Logs at a rate a hundred times that, and IP Flow records a hundred thousand times over that.

Focusing on the largest dataset, IP Flow, most sources provide already sampled data, but it is almost universally resampled on ingest, further reducing the dataset. This reduced amount of data is often further stripped down, eliminating identification fields and metric values. In many cases, each flow record is reduced to identification fields of source and destination ASN, IP, Port Number, and Protocol with metrics of Packets and Bytes. Losing contexts like source device and interface creates several records that are indistinguishable from each other that they must be “deduplicated,” reducing the data set yet again. And then only aggregated micro batches of, for example, one-minute averages are actually saved to the Data Lake.

Still, being too much data for a query processing microservice to query cost-effectively in a reasonable time frame, most systems create one or more sub-datasets of a higher temporal aggregation. For example, monthly, daily, and hourly averages of Packets and Bytes by IP and Protocol make much smaller datasets that are quicker to query. This allows the system to quickly give users or automation tools a high-level view that they can drill down from, successively targeting smaller chunks of the Hyperscale Dataset.

Some of the drawbacks to this approach might be fairly obvious. Rollups and the “full” data are only statistical overviews, both large and small shifts in the data can be under or over represented depending on the scale and duration of the event. For example, a one minute 400% spike in a metric would be an unnoticeable bump in an hour average and not even a rounding error in a daily average or higher. While you could rely on alarming, where would you get the baseline or even simple threshold from? The reduced Hyperscale Dataset might give us an idea of the traffic in our network or transiting it, and maybe even where it entered or exited, but with the loss of key identification fields and metrics, understanding where it is in the network or what is driving the traffic is lost. Most importantly is that this only allowed users and or applications to work in the frame of a dashboard. Attempting to build new baselines, find new threats, slow and low anomalies, explore correlations, and perform ad-hoc analysis and analytics are locked to both the scope and performance of the Data Systems with Hyperscale Data.

What is really required is a Data System that can cost-effectively reach the performance levels required to query trillions or more of records in acceptable and actionable time frames. This would allow us to not only keep more of the data received but ask any question at any scope or granularity of that data ensuring the data you need is there and accessible when you need it. At FermiHDI, we believe that the answer lies in a Data Store that looks like a Data Lake but handles the data like a network device. But this is a topic for another time.

--

--