Module Chaining and Row Level Enforcement

Sima Nadler
fybrik
Published in
5 min readJun 28, 2022

Co-authored by Doron Chen and Sima Nadler

This article explores complex data paths in Fybrik. Fybrik is a Kubernetes-based control plane that makes it easier to use and manage data by automating governance and other non-functional aspects of data usage.

We highlight several capabilities supported by Fybrik:

  • Fybrik module chaining capabilities — passing data between FybrikModules without writing to storage
  • Implementation of data transformations using the Pandas dataframe library.
  • Fybrik support for row-level enforcement of policies, i.e., filtering of specific rows of data based on data governance rules

1. Module Chaining

In the previous blog we explained how we leveraged Airbyte connectors for a new Fybrik module. This Airbyte module gives Fybrik applications access to a vast variety of data sources. However, it does not support any data transformations. Here, we explain module chaining and how it enables more complex data paths.

Module chaining occurs when data flows through at least two FybrikModules before reaching a workload. Sometimes one module is not enough, such as when no FybrikModule can simultaneously give you access to a certain data source and provide you with the transformations required by the governance policy.

The Fybrik manager code was designed to support module chaining. We demonstrated this scenario on a Fybrik cluster where two FybrikModules were installed: the Airbyte module and the Arrow-Flight-module. The Airbyte module provides access to multiple data sources, including Google Sheets, while the arrow-flight-module provides multiple data transformations. Our application required access to both Google Sheets (provided by the Airbyte module) and a data transformation (provided by the arrow-flight-module).

Building a Data Path

Let’s go behind the scenes and try to get into the Fyrbik manager as it constructs a data path for our application. How does it know that it can chain the Airbyte module and the arrow-flight-module? We explain this in detail below, but the short answer is that the arrow-flight-module can read from services that support the fybrik-arrow-flight interface, and that the airbyte module is such a service.

The Airbyte module definition lists its “read” capabilities, which tell us that it can respond to either arrow-flight requests or to REST requests (on separate ports).

Excerpt from the Airbyte-Module definition:

capabilities:
- capability: read
scope: workload
api:
connection:
name: fybrik-arrow-flight
fybrik-arrow-flight:
hostname:"{{.Release.Name}}.{{.Release.Namespace}}"
port: "80"
scheme: grpc

supportedInterfaces:
- source:
protocol: google-sheets
dataformat: csv

The arrow-flight-module definition lists the interfaces that the arrow-flight-module supports, which include arrow-flight. Therefore, the Fybrik manager concludes that these two modules can be chained.

Excerpt from the airbyte-module definition:

supportedInterfaces:
- source:
protocol: s3
dataformat: parquet
- source:
protocol: s3
dataformat: csv
- source:
protocol: fybrik-arrow-flight

Our FyrbikApplication indicates that the application requires a google-sheets dataset.

apiVersion: app.fybrik.io/v1alpha1
kind: FybrikApplication
metadata:
name: my-app
labels:
app: my-app
spec:
selector:
workloadSelector:
matchLabels:
app: my-app
appInfo:
intent: Fraud Detection
data:
- dataSetID: "fybrik-airbyte-sample/google-sheets-asset"
requirements:
interface:
protocol: fybrik-arrow-flight

It also ‘knows’ that the governance policy, defined as an OPA policy, requires a data transformation operation that the arrow-flight module provides (the AgeFilterAction transformation).

OPA Policy (rego file)

package dataapi.authzrule[{“action”: {“name”:”AgeFilterAction”, “columns”: column_names, “options”: {“age”: 20}}, “policy”: description}] {
description := “filter out rows based on age”
input.action.actionType == “read”
input.resource.metadata.tags.finance
column_names := [input.resource.metadata.columns[i].name | input.resource.metadata.columns[i].tags.PII]
count(column_names) > 0
}

Because the Airbyte module supports the google-sheet interface, and the two modules can be chained, the Fybrik manager constructs the following data path:

Resulting Data Path

Whether it consists of a single module or multiple modules, the data path constructed by the Fybrik manager is transparent to the application. The application only knows the virtual endpoint returned by the manager, which, in this case, is that of the arrow-flight-module.

Explanations on how to implement a use case involving module chaining can be found here.

In conclusion, module chaining enables the Fybrik manager mix and match FybrikModules in the creation of data paths.

2. Transformations and Row Level Enforcement

Until recently, Fybrik read modules demonstrated only column-level transformations, which either redacted or removed columns with personally identifiable information (PII), or according to other policy-defined criteria. The following is an explanation on how we added new transformations, specifically transformations that filter out certain rows based on some criteria.

After writing the Airbyte module, which leveraged existing tools (Airbyte connectors) to enhance the Fybrik read capabilities, we wanted to do the same for data transformations. After exploring different options, we settled on using the Pandas dataframe library for data transformations. Pandas was a particularly attractive option, as it did not require us to write a new FybrikModule. Because Pandas is already integrated with the Python Arrow library, all we had to do was slightly modify the existing arrow-flight-module.

We defined a new type of transformation called PandasAction, from which all Pandas transformations inherit. PandasAction inherits from Action, a base class for callable actions that transform record batches, from which all arrow-flight-module transformations inherit. Pandas transformations use PyArrow functions that convert Arrow Flight record batches to Pandas dataframes and back. To implement a new transformation one simply needs to implement a __dftransform__ method that accepts a dataframe and returns a transformed dataframe.

Example of a __dftransform__ method for the FilterAction transformation:

def __dftransform__(self, df: pd.DataFrame) -> pd.DataFrame:
if self.query:
return df.query(self.query)
else:
return df

One of the transformations we implemented was the AgeFilter transformation, which examines the date value in each dataset entry; it then filters out all rows that correspond to entries whose date of birth is below a certain threshold. We created a dataset filled with fake demographic data, and used the AgeFilter transformation to remove all entries associated with minors.

Summary

This blog highlights Fybrik’s ability to support a wide variety of data paths. These include data paths that can only be constructed through the pipelining of multiple FybrikModules.

Please feel free to download fybrik and try it out! What other types of modules would be of interest? Contact us if you have questions or new idea.

--

--

Sima Nadler
fybrik
Editor for

IBM Research. Expert in privacy & hybrid cloud data protection. Opinions expressed are my own.