Governing the writing of data with Fybrik

Revital Eres
fybrik
Published in
5 min readNov 22, 2022

Co-authored with Mohammad Nassar

When it comes to writing data, organizations are faced with several challenges:

  • Choosing the best data store in terms of cost, latency/distance from data source, storage compliance, and more.
  • Enforcing governance and business rules when writing data, such as prohibiting writing to non-approved external stores.

These challenges are currently addressed manually in different organizations. The end-user asks the IT specialist, who manages the organization’s resources, to choose the store and allocate space for future data. After doing so, the IT specialist provides the user with the relevant storage information (e.g., endpoint and write credentials).

The governance officer, who is responsible for enforcing privacy and business rules, then needs to verify that the written data is compliant with the organization’s governance rules. The officer does so by guiding the IT people regarding the approved storage and responding to every write request individually; there is no automatic way to enforce these rules.

In this blog, we show how Fybrik can solve these challenges automatically, while freeing organizations from needing to manually address them on every write request.

What is Fybrik and how does it work?

Fybrik is an open-source project that simplifies the use of data for applications, business analysts, and data scientists using a policy-based approach. It does so by using a control plane that automates and orchestrates data governance and infrastructure optimization.

Fybrik can automatically enforce data governance rules. This ability was described in a previous post about ING’s experience with Fybrik which demonstrated reading data.

In addition, Fybrik can choose the data store approved by enterprise policies and governance rules. It can also choose the storage and allocation space, while considering the cost/performance optimization constraints. And finally, Fybrik manages the selected storage account’s authentication process without providing credentials to the end-user.

In this post, we demonstrate the above capabilities, using the following two scenarios:

  • Fybrik enforces rules that enterprise data should not be written to Google Sheets.
  • Fybrik chooses an S3 LocalStack as the storage for writing and allocates a bucket in the storage.

The actual writing in this scenario is done by Fybrik Module, a Fybrik service deployed by a Fybrik orchestrator that can be included in the data planes. The Fybrik Module describes the service capabilities and how to deploy it.

In these scenarios, two Fybrik modules with write capabilities are used:

Fybrik arrow-flight module is a server supporting upload data using the arrow-flight protocol and writing it to an S3 destination (i.e., one that supports different S3 vendors, such as LocalStack, AWS bucket, etc.).

Similarly, the airbyte module is an arrow-flight server that leverages Airbyte connectors to write data to multiple destination types, such as Google Sheets.

Use case scenarios

In the organization connected with our scenarios, Serena is a governance officer who ensures that the data in the data lake is used appropriately. Eva is a data scientist who is building a fraud-detection machine learning module to identify customers whose behavior may be fraudulent. Tim is the IT specialist who is responsible for the organization’s infrastructure.

Before running Fybrik, each stakeholder can define his or her needs in advance:

  • Serena can define governance rules via a policy engine tool such as OPA.
  • Tim can deploy the relevant Fybrik custom resources such as FybrikModules and FybrikStorageAccounts. The FybrikStorageAccount resources are used to declare the available datastores in the organization. In addition, Tim in the policy engine, can define rules for leveraging the infrastructure. Fybrik takes these rules into consideration when choosing the storage.

Scenario 1: Fybrik enforces policies

Eva wants to save the analytic results on her Jupyter Notebook to her personal Google Sheets page using Fybrik. (Although Eva can directly write to her Google Sheet page, we are assuming for the purpose of this post, that she wants to do this using Fybrik).

Fybrik supports the writing of data onto Google Sheets through the airbyte-module. Therefore, if Eva creates a data asset for writing that contains a tag indicating the connection type as ‘Google Sheets’, then she can save her analytic results to her Google Sheets page.

However, if Serena has defined governance rules with a list of approved external data stores (as in the Rego file below), and if `Google Sheets` is not on the list, then Fybrik will automatically block the request to construct a data plane:

OPA Policy (Rego file)

# rules to allow writing to approved external stores
package dataapi.authz
# The list of the approved external stores:
approved_extrenal_store := {"cos", "aws", "localstack"}
is_approved_extrenal_store {
# A store is approved if it is found in the `approved_extrenal_store` set.
approved_extrenal_store[lower(input.resource.metadata.tags.connection_type)]
}
# Allow writing to approved external stores
rule[{}] {
input.action.actionType == "write"
is_approved_extrenal_store
}

Scenario 2: Writing new data

After consulting with Tim, the IT specialist, Eva decides to re-apply Fybrik, but this time she lets Fybrik choose and allocate the data store. By specifying isNewDataSet: true in the Fybrik application, the Fybrik orchestrator will do the following tasks automatically:

  • Choose the storage for the allocation based on the Fybrik Storage accounts applied by Tim that comply with Serena’s governance policies.
  • Allocate the actual storage.
  • Deploy an arrow-flight-module service that is configured with the details of the allocated storage.

In addition, Fybrik can register the asset in the data catalog. Alternatively, modern data catalogs (such as Open Metadata) can discover and register new data automatically.

In summary, Eva can apply Fybrik without the need to provide any details on the desired storage or without knowledge about the dataset credentials. Once Fybrik’s status is ready, it contains the arrow-flight service endpoint, allowing Eva to write her dataset by accessing the deployed service.

For more details about this scenario, see the Fybrik quick-start notebook sample where it’s described in more depth.

To summarize, Fybrik’s main contribution to writing data is its capability to choose and allocate the data store, enforce compliance to enterprise policies and governance rules, and hide credentials from end users.

We welcome your feedback and contributions, and would love to hear from you through the discussions and issues on the project’s github. Please feel free to install Fybrik and try out the Fybrik write flow on your own, and let us know what you think!

Thanks to Shlomit Koyfman

--

--