Compliant Query Processing in Geo-distributed Environments

Published in

The Agora Technology Blog

7 min readJun 18, 2021

Kaustubh Beedkar is a Senior Research Associate in the Database Systems and Information Management (DIMA) group at the Technische Universität Berlin (TU Berlin).

The goal in Geo-distributed query processing is to analyze data that spans geographical or institutional borders. Supporting Geo-distributed query processing in a unified manner is crucial for transnational companies, which houses several databases and IT infrastructure across the globe.

Geo-distributed query processing enables transnational companies to manage day-to-operations including back-office tasks and make data-driven decisions. It also enables several applications, such as those pertaining to global eCommerce, federated analytics, and R&D initiatives among institutions.

There are several distributed query processing frameworks that allow the processing of Geo-distributed data. In general, such frameworks provide a unified query interface and transparently translate a user-specified query into a so-called query execution plan (QEP). A QEP specifies, among other things, where (i.e., which site) each plan operator will be executed. For example, a two-way join query over data sources from Asia, Europe, and North America may perform the first join in Europe and the second one in Asia.

In Geo-distributed computing environments, query execution implicitly moves data (i.e., intermediate query results) between compute sites.

Cross-border Dataflow Constraints

Distributed query processing in Geo-distributed environments has been recently challenged by cross-border dataflow constraints. Dataflow constraints arise from regulations or policies that control the movement of data across geographical (or institutional) borders. Such constraints may pertain to personal data (e.g., personally identifiable information), business data (e.g., sales and finance), or data considered “important” to a country or organization. As an example, processing data generated by autonomous cars in three different geographies, such as Europe, North America, and Asia may face different dataflow constraints: there may be legal requirements that only aggregated data may be shipped out of Europe, no data at all may be shipped out of Asia, and only a part of data may be shipped from North America to Europe. While research on distributed query processing has shed light on several performance aspects, it also crucial to consider compliance aspects in Geo-distributed environments.

In particular, when generating and executing QEPs, distributed query processing frameworks must ensure that data movement complies with data regulations. We refer to this kind of distributed query processing as Compliant Geo-distributed Query Processing.

Compliant Geo-distributed Query Processing

Let us look at an example that illustrates compliant query processing. Imagine a transnational company, named CarCo, that manufactures cars, is headquartered in Europe, has customers in North America, and has a manufacturing unit with suppliers in Asia.

Following the first quarter, the operations team in CarCo wants to analyze its financial data by integrating it with the sales data from North America as well as with the data from its suppliers in Asia. The following query (CarCo_Query)captures such an analysis.

“SELECT C.name , SUM(O.totprice), SUM(S.quantity) FROM Customer AS C, Orders AS O, Supply AS S WHERE C.custkey=O.custkey AND O.ordkey=S.ordkey GROUP BY C.name”

Furthermore, based on recent studies on data movement regulations, consider the following dataflow policies:
P1. The policy in North America allows the customer data to be shipped outside only after suppressing the customer’s account balance information.
P2. The policy in Europe, allows only aggregated orders data to be shipped to Asia, and it does not allow orders price to be disclosed in North America.
P3. Lastly, the policy in Asia allows only aggregated supply data for orders quantity and extended price to be shipped to Europe.

Now, consider the two QEPs shown below. Here, the SHIP operator describes the point where intermediate results are communicated between two sites.

**Query Execution Plan Examples for CarCo_Query**

The plan on the left is non-compliant. Here the SHIP operators violate dataflow policy P1 in North America, as executing this plan will lead to shipping the Customer table without suppressing the account balance. The plan also violates the policy P2 in Europe, as it ships Orders information to Asia without aggregating it. In contrast, the QEP on the right is compliant: It performs both join operations in Europe and appropriately masks data before shipping the Customer and Supply data to Europe; observe that Masking via projection suppresses the account balance information of Customers and via aggregation suppresses the orders’ quantity as desired in the policies.

Including compliance aspects into distributed query processing frameworks as first-class citizens entails two research challenges:

Policy specification: We need an easy (thus declarative) way to specify dataflow constraints. Doing so is not trivial as constraints pertain to different types of data as well as how it is being processed. For example, data movement constraints may apply to an entire dataset, parts of it, or even to information derived from it. This all may vary depending on where the data is being transferred to.
Compliance-based Query Optimization: We need efficient ways to process queries in a manner that they are compliant with respect to dataflow regulations. In contrast to cost-based query processing techniques, which focus solely on performance aspects, we need efficient ways to include compliance aspects as well.

A Compliant-Geo-distributed Query Processing Framework

In our SIGMOD paper, we presented our initial work on compliant Geo-distributed query processing. We focused on relational data and SQL queries and have proposed a framework for compliant query processing in Geo-distributed environments.

Overview of Compliant Query Processing Framework

Policy Specification: We have developed a policy expression language, which allows users (e.g., a data officer) to reflect their data movement constraints in a simple and effective way. Policy expressions are SQL-like statements that can specify what data can be shipped to which others locations. For example, consider customer information above that is stored in a database in North America. The expression “ship name, mktseg, region from Customer C to Europe where mktseg=`commercial`” specifies what customer information (i.e., rows and columns if Customer table) can be shipped from North America to Europe. We refer to such expressions as basic expressions. Our basic expressions can support a wide range of dataflow constraints. To support policies which allow moving of aggregated information, users can use aggregate expressions. For example, the aggregate expression “ship acctbal as aggregates sum, avg from Customer C to * group by mktseg, region” allows shipping of account balance information only when it is aggregated.

Our policy expressions can specify a wide range of policies, which can be adhered to by masking via relational operations and without affecting query semantics.

Our framework stores these expressions in a so-called policy catalog. During query processing, to generate compliant QEPs, we have developed a compliance-based query optimizer that along with various performance aspects (e.g., inter-site bandwidth) also considers these policies. When enumerating alternative execution plans, it uses a lightweight validation mechanism, which we refer to as the policy evaluator, to validate if a QEP is compliant or not.

Compliance-based Query Optimization: We follow a two-step optimization process. In the first step, which we refer to as plan annotation, we generate an execution plan, where each plan operator is annotated with a set of sites that denote where the operator can be legally executed. We use the Volcano optimizer generator to generate the plan annotator. During this annotation step, the optimizer enumerates the search space of plans in a top-down fashion and derives operator annotations bottom-up. The key idea here is to treat Geo-locations as “interesting properties” that describe (i) where an operator can be legally be executed and (ii) to where its output can be legally be shipped. During enumeration, the optimizer derives these properties via a set of annotation rules and the policy evaluator. These properties also allow us to define a compliance-based optimization goal, along with other physical properties desired by the user query.

Once the plan annotator finds an annotated plan, we move to the second step in which we finalize the location of plan operators using a dynamic-programming based approach.

So how effective and efficient is the framework?

We experimentally evaluated our framework using a Geo-distributed adaptation of the well known TPC-H benchmark data. We considered several ad-hoc queries as well as standard TPC-H queries. We also implemented a policy expression generator, which allows generating random policies of different types.

In an experiment, where we considered ad-hoc queries and different types of expressions, we found that our approach always produced a compliant plan. In contrast, a traditional cost-based approach led to non-compliant plans in almost 50% of the cases.

Our framework also incurs a low optimization overhead. In comparison to a traditional cost-based optimizer, the optimization overhead for most queries was always less than few 100ms. Our results also indicate good scalability with respect to the number of policy expressions. The optimization time increases with the number of policy expressions, but this increase is linearly proportional to the number of expressions that affect a query’s search space.

Overall our approach is highly effective in producing compliant QEPs and incurs an acceptable optimization overhead.

Outlook

Nowadays, more and more applications require executing analytics over data across multiple Geo-distributed sites. Support for compliance with respect to different data movement policies to which sites may be subjected is crucial. Our research investigates how to make Geo-distributed query processing compliant with dataflow policies. Our initial work towards a compliant query processing framework focuses on relational data and SQL programs. The next steps of the project include supporting complex data masking operations and general-purpose dataflow programs.

Reference

Kaustubh Beedkar, Jorge-Arnulfo Quiane-Ruiz, Volker Markl
Compliant Geo-distributed Query Processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD) 2021