Modern enterprises are realizing that their data is pure gold, but like the gold buried deep in the earth, data is not worth much if it cannot be accessed. Unlike gold, where scarcity increases value, quantity and completeness make data more valuable. Today’s challenge for data is to safely store vast quantities, to make it readily accessible and then keep track of it.
At the same time, global enterprises are dealing more and more with naturally geo-distributed datasets. This is the results of a number of trends:
- Enterprises operate multiple data centers across the globe to improve user-perceived latency and ensure resilience and regulatory compliance
- Enterprises tend to split their data processing between on-premise and cloud infrastructure
- Enterprises are starting to use multiple cloud providers to increase reliability and/or decrease costs.
Zenko was developed to help users federate the different object and file storage platforms that they use, both private and public. Querying and analyzing the geo-distributed data gathered across these platforms is a common need. Examples of such workloads include querying media assets/website content/e-commerce products to serve recommendations based on user interaction profiles or triggering the gathering, aggregation and analysis of system logs when a system-fault or performance degradation event is detected.
It is well known that indexing is critical for providing rapid responses to complex queries over large datasets. In a geo-distributed, and possibly heterogeneous platform, such as that provided by Zenko’s federated view of multiple storage systems, the ideal systemarchitecture for providing search is not necessarily clear. It has been shown that different approaches are better depending on the goals that are to be met here. As an example, consider a relatively simple case of three geographically distant sites with a federated view and a global metadata query capacity centralized on one site.
- If centralized indexes are generated, the query latencies will be the lowest possible, since all queries occur centrally without generating network traffic. This approach aggregates indexes from all sites on the site where the queries occur. However, keeping indexes up-to-date in this scenario requires cross-site data transfer, and can thus reduce the timeliness of query responses or lead to high cross-site bandwidth usage in the case of write-heavy workloads.
- Alternatively, if indexes are stored locally on each site, the queries will be slower as each query has to be broadcast to all sites, but with higher likelihood of being up to date and accurate, since indexes are updated locally, close the dataset.
These obvious differences are accompanied by more complex notions of network bandwidth consumption, local storage and compute resource constraints, as well as concerns regarding the cost of resources, and in the case of many public cloud platforms, the fact that data ingress and egress fees are very asymmetric. Additionally, there is an often-overlooked aspect of distributed indexing and query engine design: the proper placement of query engine state (indexes, caches etc.) as well as indexing and querying tasks across the geo-distributed system architecture and relative to the data and sources of queries. Recent work on similar problems has shown that there can be no single perfect query engine architecture in terms of state and computation placement; different placement policies are more beneficial depending on the service level objectives of the application as well as the characteristics of the workload. For example, queries for performing recommendations or serving advertisement have strict latency requirements but can tolerate lower accuracy (thus moving indexes close to query clients at the expense of increases staleness is preferable), while other types of workloads may be willing to trade latency for higher accuracy.
The rapidly evolving nature of modern distributed systems and usage suggests that the idealquery engine solution should be flexible and composable to provide the optimal architecturefor each use case. With these concepts in mind the Proteus framework is under development within the framework of the Lightkone EU project. A progress report was presented at PAPOC. The Proteus framework introduces the notion of query processing units (QPUs) which are composable sub-elements of a more traditional index-based query-architecture. A QPU can perform a basic indexing or query processing task — a query processing microservice. Each unit can be stateless or maintain internal state in memory, disk or a database, and it exposes an interface for providing a service and an interface for receiving events and notifications. The QPU approach fits well with modern microservices-based distributed architectures. The QPUs, which are composable elements offer the following basic functionalities:
- Iteration and update notification mechanisms over multiple data storage systems
- Query caching
- Query dispatching
- Query federation
- Stream operators (eg. filtering)
By using a stream-based micro-services abstraction, QPUs can be composed to create a complete and flexible query processingsolution that can be deployed automatically to provide a variety of query optimizations. The QPU based architecture decouples the query engine from the data storage.
To better see how indexing is optimized, an implementation example could be a QPU in which the QPU state is materialized in a B-tree index, the service interface is for performing index lookups, and the event interface allows the QPU to receive notifications for updates to the dataset that keep the index up-to-date. Different index structures can be implemented keeping the same interfaces, for example, if queries can benefit from the use of a cache, another implementation of the QPU can play the role of a cache. In this cache QPUthe internal state is a cache data structure, the service interface remains the same, and the event interface can be used to receive notifications for updates in the dataset and invalidate the affected cache entries. The basic notion is to always use the same abstraction, with different implementations. The last and most powerful property of the QPU abstraction is that one QPU can use the service interface of other QPUs in order to acquire partial results which it can then process and use to provide its service. These partial results are returned to the cache QPU through its event interface. In the current example the cache QPU uses an index QPU to respond in the case of a cache miss.
Another use of this approach if for scaling or distribution. For instance, if a dataset has grown and a single B-tree index in no longer efficient, the index can be partitioned and stored in separate QPUs. If one index partition receives more traffic that the others it can be replicated and load balanced. A dispatcher QPU receives queries thought its service interface, then issues sub-queries to QPUs holding the index partitions by querying their service interface. For each index partition a sub-query is just a normal query which can be respond to by looking up the index. The dispatcher QPU receives the sub-query responses through its event interface, combines them and returns the complete response. Its state consists of metadata about the index partitioning scheme which it uses to accurately divide a given query into sub-queries.
QPUs are decoupled from the infrastructure on which they run. A query engine composed of multiple QPUs, can be deployed in a number of ways; different placement strategies make different trade-offs and are suitable for different scenarios.
As a practical example, consider the following real-world scenario:
An international manufacturing company is designing a new product with tight deadlines and engineers and designers are working in 3 geographically distant sites. The components of the system are modified frequently throughout the work day with a large volume of data being generated and updated. The project requires close collaboration; working with stale information is very counterproductive.
Validation and Quality Control Phase:
As the process matures, changes become more and more infrequent, but the numbers of queries increase significantly as automatic systems take over the work to analyze the structural integrity, optimize the manufacturing costs and verify the conformity of the manufactured components. At this point, caching query indexes for the automatic analysis systems becomes important to speed up the effort.
Product Maintenance Phase:
Once the product is mature and in full production or at the end of its production lifetime, no further changes are made, and queries are only performed in the event of an incident with a delivered product.
The Proteus architecture allows the indexing architecture to evolve over the lifecycle of the project in order to meet the specific needs of the different phases, without requiring the project data to be migrated between sites and by gradually changing the QPU layout to be best suited for the project phases.
During the design phase all sites maintain local indexes of their rapidly changing data. Global queries can be performed by passing them through dispatch QPUs which query all distant indexes and collate the results. This method restricts the cross-site traffic to queries only and guarantees the most up to date data for all queries.
Validation and Quality Control Phase:
During the validation phase the remote sites send all index updates to the central site which maintains a global index and activates a cache to accommodate high numbers of queries. The indexes can be updated when changes are made, but this requires some cross-site update traffic. Since modifications are infrequent compared to query rates, this approach is more efficient for this project phase. The system can migrate gradually from one stage to the other by continuing to dispatch queries to remote sites until the central index is up to date. At this point the queries can be directed to the central index and the local indexes can be decommissioned.
Product Maintenance Phase:
In this final phase, the cache which was used in the previous phase can be deactivated to save resources. If higher cache rates are needed at any specific time, a cache can be reactivated. If no more updates occur, the query system can be limited to an index that is maintained on the central site, depending on the data retention requirements.
By Brad King and Dimitrios Vasilas, Scality