AnyLog: Taming the Complexities of Unified IoT Data Processing
--
Written by: Faisal Nawab — Assistant Professor at UC Irvine
The proliferation of Internet of Things (IoT) devices and the huge amounts of data that they are projected to produce pose unprecedented challenges for data management systems. A key question that needs to be answered is how can we process and utilize this huge amount of data. Big Data management in the last decade enabled us to answer this question for cloud applications. However, these solutions do not provide the complete answer to IoT data. IoT data processing involves many complexities and requirements that often lead to contradictory design decisions.
The first complexity is that IoT data, often times, requires being processed at the edge of the network. This can be due to many reasons ranging from real-time processing requirements to the need to optimize edge-to-Internet bandwidth costs and performance. This requirement alone makes cloud solutions — and the ability to utilize powerful compute nodes — infeasible.
Design requirement 1: the IoT data infrastructure needs to be deployed at the edge
The second complexity is that IoT data is produced in many different locations around vast geographical areas. To enable utilizing data from geographically-dispersed devices, the data processing infrastructure needs to be geographically distributed to retain the real-time, bandwidth-efficiency aspects from design requirement 1.
Design requirement 2: the IoT data infrastructure needs to be geographically distributed
The third complexity is that IoT data is diverse. An application may utilize data from IoT devices with different features, data structures, and schemas. The ability to utilize the data from diverse — but related — IoT devices will provide an opportunity to generate better insights and analytics.
Design requirement 3: the IoT data infrastructure needs to enable integrating data from diverse IoT devices
Deploying big data workloads at the cloud or data center requires IT expertise. These expertise are essential as big data brings multiplicity of challenges involving High Availability, scaling, security, recovery, removal of unneeded data and many more. Companies deploying Edge solutions may find themselves in a state that rather than supporting a centralized data center, they need to support thousands of distributed small data centers near the edge. In this type of setup, the needed IT expertise may exceed the IT resources available.
Design requirement 4: the infrastructure deployed at the edge needs to be self managed
Until recently, designing a system with these four design requirements is infeasible with existing Big Data technologies and current edge computing models. The first two design requirements lead to the need to control an infrastructure of powerful compute nodes that are distributed close to IoT devices across large geographic regions. This type of infrastructure is not available today and current cloud and edge computing technologies do not provide the foundations needed for such an infrastructure. Cloud and edge computing still require ownership of the infrastructure within a single or few control domains. However, the scale of such massive, geographically-distributed infrastructure is not feasible with this ownership model.
As for the third design requirement, current solutions include data integration solutions that aim to find similarities and connections between data from different sources. These solutions, however, remain to be limited and often require extensive manual intervention or complex data processing.
The fourth design requirement is not supported as current solutions are not self managed. One reason that makes it hard to develop a self-managed system at the edge is the fact that there is no unification of the way data is managed. The data management at the edge is based, in most cases, on proprietary projects — different types of data are treated differently creating not only silos of data but also setups that require significant proprietary and non-uniform knowledge. Self management, on the other hand, requires a unified process across all the functionalities deployed.
AnyLog — A Decentralized IoT Network
AnyLog is a startup that aims to provide a solution to the complexities above and provide a unified data infrastructure for IoT data processing. In AnyLog, it was observed that the restrictions standing between us and the four design requirements above have been limits of the underlying technologies that are used for data processing, whether on the cloud or the edge. However, this is now changing with the advent of blockchain technologies. AnyLog shows that by utilizing Blockchain technologies, we can finally provide an answer and a solution that combines the four design requirements above.
The AnyLog Network is a collection of nodes (a node can be a small device or gateway and up to the largest server) with the AnyLog software. The software supports p2p communication between the nodes of the network as well as read and write access to a shared metadata layer. The metadata layer is represented on a blockchain.
What is special about blockchain systems is that they enable combining decentralized coordination — through smart contracts — and compensation — through a cryptocurrency, in a unified framework. This provides the opportunity to build a data processing infrastructure that brings together decentralized compute nodes that are able to process data as a single machine.
For AnyLog, decentralization provides the ability to process data where the data resides using independent nodes that can work autonomously. At the same time, the nodes share the metadata (which is published as a set of policies on the blockchain) and can exchange messages allowing the nodes to operate in sync and share and query in real-time any needed data or state.
The fact that a large centralized database is replaced with many small distributed databases at the edge provides significant advantages:
First, queries are processed concurrently on multiple nodes (per query, AnyLog identifies the nodes with relevant data and deploy a MapReduce type of process). With this approach, performance tuning is automated as with more data, the distribution of the data is increased leading to additional nodes participating in the query process and less data on each node.
Secondly, it is much simpler to automate HA with a small database vs. a database supporting complex big data operations. AnyLog leverages the fact that a small database is simple to replicate and when an edge database is updated, one or more mirrored databases are updated concurrently. With this approach, if a node fails, one of the mirrors kicks in without downtime and a background process will either recover the failed node or create a new copy of the data.
Unified schema and data view
AnyLog unifies the treatment of the data and provides a unified data view. For the user or application, data is treated as if it is organized in a single unified relational database. However, physically, the data remains in-place, distributed at the edge of the network. This is done using virtualization, allowing users and applications to be serviced by a unified data plane.
When AnyLog Nodes receive data, they will identify the structure of the data. This structure determines a schema and with the identified schema the node proceeds with one of two options: If the schema is an existing schema (it is published on the blockchain), the node will host the data using the schema. If the schema is new (not available on the blockchain), the node will publish the schema on the blockchain such that a different node that will need to store the same type of data will use the same schema. This approach creates, automatically, a unified metadata layer across all the nodes of the network.
From here, a user and application can be offered with the list of tables (using a lookup to the blockchain data) and for each table they can view the columns and data types that make each table. This process allows to formulate queries in the exact same way that queries to a centralized database are formulated. With this approach, a query to any table can be handed to any node in the network.
Self managed processes
One value of the AnyLog approach is the fact that the data is treated in a unified way. Regardless of the type of data, the same processes apply — leading to a published schema and a MapReduced query process. This uniformity makes the self-managed feature doable — as explained HA is automated and processes like data partitioning, distribution of data, removal of data, are all automated.
The ability to decentralize high-performance and self-managed computation with the ability to utilize diverse IoT devices and data has the potential to open new opportunities and applications. These applications would create richer experiences due to the performance characteristics and unification of diverse IoT devices. This approach gets us one step closer to the future of IoT where the silos between IoT applications are broken.
For more information, read the research paper: http://cidrdb.org/cidr2020/papers/p9-abadi-cidr20.pdf