Fybrik Modules: How to Leverage External Projects

Published in

fybrik

5 min readApr 12, 2022

Co-Authored by Doron Chen and Sima Nadler

Introduction

Anyone interested in using Fybrik to protect the data in their organization will find this article useful. We show the flexibility of Fybrik and how assets developed for other projects can be used in data planes generated by Fybrik. Read on for a demonstration on how to enrich the options that can be used to handle the non-functional aspects of using data.

Specifically we will show how we leverage Airbyte connectors as Fybrik modules to access data from a vast number of types of data sources.

What is Fybrik?

Fybrik simplifies the use of data by data scientists, business analysts, and applications by using a policy-based approach to realize the non-functional requirements. It does this with a control plane that automates and orchestrates the data governance and infrastructure optimization steps that are handled manually today. The Fybrik introduction provides more details, and you can also read the case study about ING’s experience with Fybrik.

A Fybrik Module is a service that can be included in data planes. It describes the capabilities provided by the service and how to deploy or configure it.

What we’re trying to do

Getting and using data is a complex, and often manual process. It often requires input from a data governance officer regarding regulatory and enterprise policy issues, alongside help from an IT administrator to prepare and provide the data. Fybrik orchestrates and automates this process. It constructs the data plane for a dataset based on: the workload context, data governance policies, IT config policies, and the modules available for use. The richer the library of modules the more value Fybrik provides.

Developing these capabilities from scratch is time consuming and labor intensive. So, we wanted to see if we could leverage existing open source capabilities and use them as Fybrik modules. In this blog we describe how to leverage an Airbyte connector as a FybrikModule, illustrated in the following diagram.

Fybrik Architecture Showing Use of Airbyte Module

Why We Started with Airbyte

The key capabilities of FybrikModules are the reading and/or writing of data, as well as data transformations. For this reason, we examined both closed- and open-source projects that contained a rich set of connectivity to a range of data sources. One source for these kinds of connectors are integration projects such as Apache Camel, Singer, and Airbyte.

We decided that Airbyte would be a good starting point because it is open-source, has a large number of connectors, has a simple interface, and running Airbyte connectors requires no prerequisites other than Docker.

What is Airbyte?

Airbyte is a data integration tool that focuses on extracting and loading data. It is a rather new open-source project, with significant adoption.

The Airbyte project is conceptually composed of two parts: a platform and connectors. The platform provides all the services required to configure and run data movement operations, such as the UI, job scheduling, logging, and alerting. The connectors push/pull data to/from sources and destinations. Our focus is on the connectors.

Airbyte has a vast catalog of connectors that support hundreds of data sources and data destinations. These Airbyte connectors run in docker containers and are built in accordance with the Airbyte specification.

Creating a FybrikModule

As it turns out, it is not all that difficult to write a FybrikModule. One can start from an existing FybrikModule, such as the arrow-flight-module.

For our Airbyte ‘read’ module, we do not need the Airbyte platform. We only use Airbyte connectors for data sources. As a result all we need is Docker.

Following are the steps to create a ‘read’ FybrikModule:

Write a server that accepts a configuration and responds to client requests for datasets.
Package the server as a Helm chart.
Write the FybrikModule yaml, which points to the newly created Helm chart.

In working on step 1, we originally wrote the server to give us access to datasets from google sheets. But as we explain later, at a certain point we realized that our code, without modifications, supports multiple data sources.

Because the service we are running needs to run Docker within a k8s pod, we based our deployment on a dind (Docker in Docker) container image. We wrote our server code in Python, using the Python Docker package.

The Fybrik taxonomy provides a mechanism for all the components to interact using a common dialect. We customized the Fybrik taxonomy, based on the instructions in the Fybrik documentation. To better explain this, consider the following snippet from the yaml file defining our new FybrikModule:

capabilities:
    - capability: read
      scope: workload
      api:
        connection:
          name: http
          http:
            hostname: "{{ .Release.Name }}.{{ .Release.Namespace }}"
            port: "79"
            scheme: grpc
      supportedInterfaces:
        - source:
            protocol: google-sheets
            dataformat: csv

Neither the ‘http’ connection type, nor the ‘google-sheets’ interface were previously defined in the default Fybrik taxonomy. Therefore, we had to create a customized taxonomy file, incorporating the new connection and interface types, before we ran the helm command to install Fybrik.

Why our module works for multiple data sources

As mentioned earlier, midway through writing the Airbyte module for google sheets sources, we realized that the module could work as-is for other data sources. One reason is that the module does not need to obtain the connectors in advance. Whenever Docker attempts to run a container from a container image, it checks whether the image is present locally, and if not, fetches it.

The other reason our code works for multiple data sources is the Airbyte interface: although the connection information required to connect to different data sources varies dramatically, the responses sent back from the connectors all have the same format.

In other words, the two things that distinguish one dataset from another are: the name of the Airbyte connector needed to fetch it, and the connection information to the source. Both of these are part of the configuration and not part of the code.

We verified that our code works not only for google sheets, but also for reading from postgres, MySQL, and retrieving US census data.

Limitations

One limitation of our code is performance. There is significant overhead involved in running Docker containers. This is especially true when you run a connector for the first time, which requires Docker to fetch the connector Docker image. Another limitation is that our code reads entire datasets — it does not currently support incremental retrieval of data sets.

Summary

Our experiments with AirByte proved to be a successful example of how anyone can easily leverage external projects as Fybrik modules to read and write from many varied data stores. By adding a the Airbyte module we provided the ability for data planes generated by the Fybrik controller to read data via the vast library of Airbyte connectors. Similarly one could add the ability to write using Airbyte connectors.

The replicate our experiment please see the instructions in our module’s github repository.