Rapid reads — Optimizing Edge Computing: Data Preprocessing and pipelining

This article is the last of the series on edge computing for ML, if needed more context please refer through the previous articles here

Now that we have seen the techniques of model optimization, let;s have a look at the second most important step of this process. The one which started it all — Data.

Why data preprocessing?

And how is it different from normal data preprocessing in ML pipelines? Well, the answer comes down to the amount of data possible to flow through the edge device memory. We know that it is short on memory hence the data preprocessing not only thinks about optimizing the data for structure but also for memory. Hence, in edge computing the concept of ‘summarization’ of data would always come up. Meaning, to only pass the amount of data which is necessary for getting the desired results, and nothing more.

Techniques for data preprocessing and feature engineering are many and don’t differ much from traditional ML preprocessing methods.

But there are some pipelining processes which make a big impact

Stream processing frameworks

Stream processing frameworks like Apache Kafka Streams, Apache Flink, and Apache Storm are pivotal for handling continuous data streams in real-time. They offer scalability, fault tolerance, low latency, event time processing, and state management, making them ideal for real-time analytics, fraud detection, IoT, clickstream analysis, and log analysis applications. These frameworks enable organizations to process, analyze, and derive insights from streaming data, facilitating proactive decision-making and enhancing operational efficiency.

Lightweight data serialization formats

Lightweight data serialization formats such as Protocol Buffers, Apache Avro, MessagePack, CBOR, and FlatBuffers provide efficient mechanisms for serializing data into compact binary representations. These formats offer benefits like efficiency, interoperability, schema evolution, and performance, making them well-suited for network communication, distributed systems, and high-performance applications. By reducing message sizes and network bandwidth, lightweight serialization formats optimize data transmission and storage, facilitating seamless data exchange in heterogeneous environments.

Data compression techniques

Data compression techniques play a crucial role in reducing the size of data for storage or transmission, leading to lower storage costs, faster data transmission, and improved system performance. Lossless compression algorithms like DEFLATE and LZ77, lossy compression techniques like JPEG and MP3, dictionary-based compression, run-length encoding (RLE), and Huffman coding are commonly used for file compression, network compression, database compression, multimedia compression, and backup compression applications. These techniques enable organizations to efficiently store, transmit, and process large volumes of data, optimizing resource utilization and enhancing overall system efficiency.

Well, this is not it. I do invite you to explore more techniques for preprocessing data and optimizing data pipelines for edge computing and machine learning.

And this is not it, in the following and probably the last article we would wrap up this rapid reads series with deployment strategies, tools and frameworks and some case studies to conclude with. Stay positive! Keep the edge computing going! Cheers.

--

--