[Edge Computing Series #1] The Rise of Micro Data Lakes

Micro data lakes are revolutionizing data management by offering a more agile and scalable approach to storing and analyzing data, empowering organizations to efficiently handle varying data volumes and quickly extract valuable insights.

Published in

machbase

10 min readMay 3, 2024

📝Table of Contents
● Edge computing overview
● Edge Computing vs Cloud Computing
● What’s missing from edge computing
● Data is a major concern for cloud providers
● Introducing international contributions
● When the “data monster of death” emerges at the edge
● How to breathe a true “data breath of life” into the edge — Micro data lake
● But, this doesn’t make sense, does it?

Edge computing overview

Edge computing is a technological approach in which data processing is performed on local computing equipment at a distance from a centralized data center. This approach plays an important role in analyzing and processing data at the point of generation, especially for Internet of Things (IoT) devices and industrial machines. Globally, a similar technology scheme was introduced by Cisco under the name of fog computing to address the shortcomings of cloud computing, and more recently, they have all been consolidated under the name of edge computing, and various technologies and companies have started to emerge since the end of the 2010s. Schematically, it can be described as shown in the figure below.

Edge vs. cloud computing/ Source: Samsung Semiconductor(link)

In other words, it boils down to whether the data is being processed by a cloud server or a computer at the edge.

Edge Computing vs Cloud Computing

So, let’s take a look at the differences between these two technologies and compare their pros and cons, as shown below.

As shown above, both technologies can be seen in terms of compensating for each other’s weaknesses, and in terms of the market, we can say that cloud computing is still much more popular.

What’s missing from edge computing

You might think this is because they both focus on the word ‘compute’, and define the cloud/edge boundary in terms of where data is processed (or analyzed). But if you think about it, there’s one important thing that’s not mentioned.

This is because when we talk about cloud computing, the idea of aggregating all the data on a server and then analyzing the massive amounts of aggregated data is the order of the day, whereas when we talk about edge computing, we emphasize the fact that data processing happens at the edge, but we rarely talk about the storage and subsequent use of that data. Perhaps this is because it implies the idea that real-time computing happens at the edge, but data is naturally collected in data lakes in the cloud.

Data is the main concern of cloud providers

The concept of a datalake is a place where data is collected, consolidated, and prepared for analysis. And the people who are really serious about this data are the cloud providers.
These “cloud providers” have a very basic model, where their business only works in perpetuity if all the data on the planet is fed into their cloud realm. (Remember, it costs them nothing to get data into the cloud, but they charge you everything to get it out).
By the way, if you look at Amazon’s, Google’s, and MS’s edge services one by one, there is hardly any illustration or approach to a model where data is primarily stored at the “edge” rather than in the cloud (why would anyone suggest such a model?).
Sure, they use the term “edge computing” to keep up with global technology trends and talk about it as if they are at the forefront of this technology, but if you look closely, their “edge computing” is not about storing and processing data at the edge. By focusing on the “compute” part of the term, they’re really talking about processing at the edge, but sending all that data to the cloud.

A typical example is Amazon’s “AWS for the Edge” link.

All the important terms of edge computing are there, but when you get to the part about data at the edge, there’s a little bit about security, but nothing about the specific size of the data or how it’s handled……., just “compute”. Of course, if the amount of data is not that large and can be handled in the cloud, and the customer is willing to pay for it, then it is a win-win situation. But if there is more data coming from the edge than expected, the amount of edge equipment is high, and the amount of data going to the cloud exceeds expectations, whose pockets are going to pay for it, and whose model is going to be the happiest?

Introducing international contributors

On LinkedIn, a man called Derek Mak is an evangelist of sorts, posting on a variety of topics. Take a look at this provocative post from 2020.

Microdatalake proposal blog screen https://www.linkedin.com/pulse/rise-micro-data-lakes-edge-derek-mak/

The article talks about a lot of different things from a lot of different perspectives, but what struck me most was that “data is crude oil”, that once you collect data you can only discover the possibilities of data processing, and that we need “micro datalakes” that put data at the center of everything, even at the edge, to reduce costs and enable independent data processing.
I’ve read a lot about edge computing on the web, but the idea that we need data lakes at the edge is kind of a crazy idea, isn’t it?

When “data monsters of death” emerge at the edge

As time passed, it became the year 2024. I don’t know if in 2020 when I wrote that article, I would have advocated for “micro data lakes,” assuming a real-world explosion at the edge. Hypothetically, I wonder if this kind of insight might have come from thinking about edge computing, thinking about situations that will happen at some point, and solutions for those situations.

The data monster of death from edge devices

Let’s talk about a recent case of a company that cannot be named due to an NDA. The company is part of a conglomerate that produces a wide range of high-tech equipment and materials, and of course it produces products of different types and quality every year.

The production process is continuous, and at each stage, specific sensors collect more than a thousand samples of data per second to determine the status of the equipment, with hundreds of sensors per unit.

The task of the person in charge is to develop an AI module that analyses this huge amount of data in real-time to detect anomalies and prevent quality or production problems in advance.

The problem he faces is that at first, he thinks it’s a small dinosaur the size of his palm, but it turns out to be a Tyrannosaurus.

Let’s make a quick list of the issues.

1️⃣Data collection
● We currently store this collected data in a CSV file.
● But I’m getting hundreds of multi-GB files a day. So far we have over a thousand.

2️⃣Extracting data
● You need to extract data for AI training.
● However, you need to manipulate hundreds of files in an interpreter language like Python for a long time.

3️⃣Processing data
● The process of creating AI training data involves manipulating data, creating and storing multiple CSV files with different data in a specific pattern.
● However, after training, we realise that the data is not correct, so we repeat it again.

4️⃣Visualize your data
● I want to see the data I want to learn from.
● But just to get a glimpse, I have to read and write hundreds of gigabytes of CSV files to collect, extract, and process any data I want.

5️⃣Cloud (server) data transfer and integration/management costs
● Since we don’t have a server for the actual AI training, we have to transfer this data to cloud storage. The transfer itself is difficult, but it is also almost impossible to check that everything is transferred intact. Thousands of files have to be transferred one at a time.
● As a result, the amount of data in this cloud data lake has grown enormously and is still growing.
● Finally, the cost of storage and management made me wonder if it was worth it.

This raises the question of whether or not a cloud-based, data-centric model is realistic.

The problem is even more staggering when analyzed quantitatively in terms of data volume.
● Sensor data collection rate: 1000s per second
● Number of sensors per machine: 40
● Number of machines on a line: 150
● Number of production lines in a factory: 32
● Number of production sites worldwide: 4

So, assuming you collect 3 months of data per line and 10 bytes of data per event, how much data do you need to manage? (Assuming you have collected enough data for AI training)

3 months of data storage per device = 1000 (Hz) x 40 (sensors) x 60 (seconds) x 60 (minutes) x 24 (hours) x 90 (days) = 31.1 billion events.

Multiply this by the number of devices, 150, and you get the total amount of sensor data from a single line, all managed by a single cloud server. This means there are 46 trillion sensor data points, and at 10 bytes per point, the CSV file would take up around 434 terabytes of storage. The good news is that this data is raw, and if you do anything with it or index it, you’re looking at a “data monster” like you’ve never seen before. What would be the infrastructure and management costs of managing this data space in the cloud?

How to breathe a real “data breath of life” into the edge — Microdataake

So Derek Mak’s concept of “micro datalakes” makes sense today.

In other words, all data generated at the edge is managed by building and managing data lakes at the edge, and only the really important data is sent to the cloud: alarms, events, fault log data, metadata, etc. And, if we keep the edge-based computation, or “edge computing” as the original authors intended, doesn’t this feel like a bit of a back-and-forth?

Looking at the customer example above from an edge perspective, it seems that if we can get a micro data lake right, with around 2–3TB of storage per machine, there is some breathing space.

Let’s poetically describe this as “breathing life into data” because it opens up time and space for data to come alive. I think it makes the following things possible.

1️⃣ We can now store data in real-time.
● Tens of thousands of bits of data per second are no longer lost because it’s a micro-database, it stores all the data and indexes it in real-time.
2️⃣ We can keep all the past events and history on the edge.
● If a certain problem occurs, we can identify the cause of the problem within that timeframe. Because it’s a micro-data lake that has all the data.
3️⃣ I can extract data in real-time.
● I can extract the part of the data I want from 100 billion records. Because it’s a micro database that already has indexed data.
4️⃣ The cost of maintaining and managing data has been revolutionized.
● You only pay for the initial edge storage device, and then the cost of maintenance tends to zero, except for device failures.

But, this doesn’t make sense, does it?

That’s right. So far, I’ve described a “micro data lake” for “edge computing” with a lot of rhetoric and assumptions, but I haven’t really talked about whether it’s technically feasible because everyone knows the technical limitations, such as

1️⃣Does the edge have the computing power to build a data lake? How do you install an enterprise data solution on edge servers?
2️⃣How do you manage hundreds of billions of data records in real-time on a small edge? Simple CSV storage is not the answer. It’s not searchable.
3️⃣We are talking about 40,000 data records per second with real-time data extraction to enable edge computing. Is this possible?
4️⃣data compression is also required. I need too much storage. Can’t I access the data in a compressed state?
5️⃣How will we visualize the data? If we’re managing all the data at the edge, shouldn’t we be able to show something here? We need daily, weekly, and monthly trends and data statistics visualizations for 300 billion pieces of data.

“Micro datalakes”

This may seem like a futuristic technology that’s difficult to implement.

In any case, now that we’ve identified a market need that you might think is ridiculous, let’s talk about how to implement it in the next series.

🧭Homepage 🚀Machbase Neo📍Github🗣️️LinkedIn 🎬Youtube

📧Email

Machbase stands as the world’s fastest timeseries database, offering an ideal solution for diverse environments. Whether it’s edge devices with limited resources or clusters processing massive amounts of data, Machbase excels in scalability to meet the demands of any scenario.