[Edge Computing Series #1] The Rise of Micro Data Lakes
Micro data lakes are revolutionizing data management by offering a more agile and scalable approach to storing and analyzing data, empowering organizations to efficiently handle varying data volumes and quickly extract valuable insights.
đTable of Contents
â Edge computing overview
â Edge Computing vs Cloud Computing
â Whatâs missing from edge computing
â Data is a major concern for cloud providers
â Introducing international contributions
â When the âdata monster of deathâ emerges at the edge
â How to breathe a true âdata breath of lifeâ into the edge â Micro data lake
â But, this doesnât make sense, does it?
Edge computing overview
Edge computing is a technological approach in which data processing is performed on local computing equipment at a distance from a centralized data center. This approach plays an important role in analyzing and processing data at the point of generation, especially for Internet of Things (IoT) devices and industrial machines. Globally, a similar technology scheme was introduced by Cisco under the name of fog computing to address the shortcomings of cloud computing, and more recently, they have all been consolidated under the name of edge computing, and various technologies and companies have started to emerge since the end of the 2010s. Schematically, it can be described as shown in the figure below.
In other words, it boils down to whether the data is being processed by a cloud server or a computer at the edge.
Edge Computing vs Cloud Computing
So, letâs take a look at the differences between these two technologies and compare their pros and cons, as shown below.
As shown above, both technologies can be seen in terms of compensating for each otherâs weaknesses, and in terms of the market, we can say that cloud computing is still much more popular.
Whatâs missing from edge computing
You might think this is because they both focus on the word âcomputeâ, and define the cloud/edge boundary in terms of where data is processed (or analyzed). But if you think about it, thereâs one important thing thatâs not mentioned.
This is because when we talk about cloud computing, the idea of aggregating all the data on a server and then analyzing the massive amounts of aggregated data is the order of the day, whereas when we talk about edge computing, we emphasize the fact that data processing happens at the edge, but we rarely talk about the storage and subsequent use of that data. Perhaps this is because it implies the idea that real-time computing happens at the edge, but data is naturally collected in data lakes in the cloud.
Data is the main concern of cloud providers
The concept of a datalake is a place where data is collected, consolidated, and prepared for analysis. And the people who are really serious about this data are the cloud providers.
These âcloud providersâ have a very basic model, where their business only works in perpetuity if all the data on the planet is fed into their cloud realm. (Remember, it costs them nothing to get data into the cloud, but they charge you everything to get it out).
By the way, if you look at Amazonâs, Googleâs, and MSâs edge services one by one, there is hardly any illustration or approach to a model where data is primarily stored at the âedgeâ rather than in the cloud (why would anyone suggest such a model?).
Sure, they use the term âedge computingâ to keep up with global technology trends and talk about it as if they are at the forefront of this technology, but if you look closely, their âedge computingâ is not about storing and processing data at the edge. By focusing on the âcomputeâ part of the term, theyâre really talking about processing at the edge, but sending all that data to the cloud.
A typical example is Amazonâs âAWS for the Edgeâ link.
All the important terms of edge computing are there, but when you get to the part about data at the edge, thereâs a little bit about security, but nothing about the specific size of the data or how itâs handledâŚâŚ., just âcomputeâ. Of course, if the amount of data is not that large and can be handled in the cloud, and the customer is willing to pay for it, then it is a win-win situation. But if there is more data coming from the edge than expected, the amount of edge equipment is high, and the amount of data going to the cloud exceeds expectations, whose pockets are going to pay for it, and whose model is going to be the happiest?
Introducing international contributors
On LinkedIn, a man called Derek Mak is an evangelist of sorts, posting on a variety of topics. Take a look at this provocative post from 2020.
The article talks about a lot of different things from a lot of different perspectives, but what struck me most was that âdata is crude oilâ, that once you collect data you can only discover the possibilities of data processing, and that we need âmicro datalakesâ that put data at the center of everything, even at the edge, to reduce costs and enable independent data processing.
Iâve read a lot about edge computing on the web, but the idea that we need data lakes at the edge is kind of a crazy idea, isnât it?
When âdata monsters of deathâ emerge at the edge
As time passed, it became the year 2024. I donât know if in 2020 when I wrote that article, I would have advocated for âmicro data lakes,â assuming a real-world explosion at the edge. Hypothetically, I wonder if this kind of insight might have come from thinking about edge computing, thinking about situations that will happen at some point, and solutions for those situations.
Letâs talk about a recent case of a company that cannot be named due to an NDA. The company is part of a conglomerate that produces a wide range of high-tech equipment and materials, and of course it produces products of different types and quality every year.
The production process is continuous, and at each stage, specific sensors collect more than a thousand samples of data per second to determine the status of the equipment, with hundreds of sensors per unit.
The task of the person in charge is to develop an AI module that analyses this huge amount of data in real-time to detect anomalies and prevent quality or production problems in advance.
The problem he faces is that at first, he thinks itâs a small dinosaur the size of his palm, but it turns out to be a Tyrannosaurus.
Letâs make a quick list of the issues.
1ď¸âŁData collection
â We currently store this collected data in a CSV file.
â But Iâm getting hundreds of multi-GB files a day. So far we have over a thousand.
2ď¸âŁExtracting data
â You need to extract data for AI training.
â However, you need to manipulate hundreds of files in an interpreter language like Python for a long time.
3ď¸âŁProcessing data
â The process of creating AI training data involves manipulating data, creating and storing multiple CSV files with different data in a specific pattern.
â However, after training, we realise that the data is not correct, so we repeat it again.
4ď¸âŁVisualize your data
â I want to see the data I want to learn from.
â But just to get a glimpse, I have to read and write hundreds of gigabytes of CSV files to collect, extract, and process any data I want.
5ď¸âŁCloud (server) data transfer and integration/management costs
â Since we donât have a server for the actual AI training, we have to transfer this data to cloud storage. The transfer itself is difficult, but it is also almost impossible to check that everything is transferred intact. Thousands of files have to be transferred one at a time.
â As a result, the amount of data in this cloud data lake has grown enormously and is still growing.
â Finally, the cost of storage and management made me wonder if it was worth it.
This raises the question of whether or not a cloud-based, data-centric model is realistic.
The problem is even more staggering when analyzed quantitatively in terms of data volume.
â Sensor data collection rate: 1000s per second
â Number of sensors per machine: 40
â Number of machines on a line: 150
â Number of production lines in a factory: 32
â Number of production sites worldwide: 4
So, assuming you collect 3 months of data per line and 10 bytes of data per event, how much data do you need to manage? (Assuming you have collected enough data for AI training)
3 months of data storage per device = 1000 (Hz) x 40 (sensors) x 60 (seconds) x 60 (minutes) x 24 (hours) x 90 (days) = 31.1 billion events.
Multiply this by the number of devices, 150, and you get the total amount of sensor data from a single line, all managed by a single cloud server. This means there are 46 trillion sensor data points, and at 10 bytes per point, the CSV file would take up around 434 terabytes of storage. The good news is that this data is raw, and if you do anything with it or index it, youâre looking at a âdata monsterâ like youâve never seen before. What would be the infrastructure and management costs of managing this data space in the cloud?
How to breathe a real âdata breath of lifeâ into the edge â Microdataake
So Derek Makâs concept of âmicro datalakesâ makes sense today.
In other words, all data generated at the edge is managed by building and managing data lakes at the edge, and only the really important data is sent to the cloud: alarms, events, fault log data, metadata, etc. And, if we keep the edge-based computation, or âedge computingâ as the original authors intended, doesnât this feel like a bit of a back-and-forth?
Looking at the customer example above from an edge perspective, it seems that if we can get a micro data lake right, with around 2â3TB of storage per machine, there is some breathing space.
Letâs poetically describe this as âbreathing life into dataâ because it opens up time and space for data to come alive. I think it makes the following things possible.
1ď¸âŁ We can now store data in real-time.
â Tens of thousands of bits of data per second are no longer lost because itâs a micro-database, it stores all the data and indexes it in real-time.
2ď¸âŁ We can keep all the past events and history on the edge.
â If a certain problem occurs, we can identify the cause of the problem within that timeframe. Because itâs a micro-data lake that has all the data.
3ď¸âŁ I can extract data in real-time.
â I can extract the part of the data I want from 100 billion records. Because itâs a micro database that already has indexed data.
4ď¸âŁ The cost of maintaining and managing data has been revolutionized.
â You only pay for the initial edge storage device, and then the cost of maintenance tends to zero, except for device failures.
But, this doesnât make sense, does it?
Thatâs right. So far, Iâve described a âmicro data lakeâ for âedge computingâ with a lot of rhetoric and assumptions, but I havenât really talked about whether itâs technically feasible because everyone knows the technical limitations, such as
1ď¸âŁDoes the edge have the computing power to build a data lake? How do you install an enterprise data solution on edge servers?
2ď¸âŁHow do you manage hundreds of billions of data records in real-time on a small edge? Simple CSV storage is not the answer. Itâs not searchable.
3ď¸âŁWe are talking about 40,000 data records per second with real-time data extraction to enable edge computing. Is this possible?
4ď¸âŁdata compression is also required. I need too much storage. Canât I access the data in a compressed state?
5ď¸âŁHow will we visualize the data? If weâre managing all the data at the edge, shouldnât we be able to show something here? We need daily, weekly, and monthly trends and data statistics visualizations for 300 billion pieces of data.
âMicro datalakesâ
This may seem like a futuristic technology thatâs difficult to implement.
In any case, now that weâve identified a market need that you might think is ridiculous, letâs talk about how to implement it in the next series.
đ§Homepage đMachbase NeođGithubđŁď¸ď¸LinkedIn đŹYoutube
đ§Email
Machbase stands as the worldâs fastest timeseries database, offering an ideal solution for diverse environments. Whether itâs edge devices with limited resources or clusters processing massive amounts of data, Machbase excels in scalability to meet the demands of any scenario.