Create secure and compliant Kubeflow pipelines with Fybrik

Sima Nadler
fybrik
Published in
3 min readSep 28, 2022

Are you a data scientist looking for an easy way to use data? Does the thought of dealing with data store endpoints, credentials, data governance regulations, enterprise data regulations and data security give you a headache? If so, Kubeflow Pipelines combined with Fybrik is just for you!

Kubeflow Pipelines enable data scientists, and other developers, to orchestrate tasks using a workflow engine while continuing to work in their native python language thanks to the Kubeflow Pipeline SDK. So if you are used to manually loading data, checking its quality, then training your model, testing your model and storing its results, life can be simplified. You define a pipeline to automate the process, and then use it over and over again whenever you want and on whatever data you want.

Cool! However, the fact that you can automate the flow with Kubeflow Pipelines doesn’t solve the data governance challenges. You are still left to deal with finding the data, getting access to it, getting permission from a data governance officer to use it (or a copy of it), pre-processing it for your needs, and only once all of that is done can you actual start your real work.

We have created a Kubeflow Pipeline component that interfaces with Fybrik and transparently handles these issues for you. Data owners put the raw data sets in a data catalog. You choose the data of interest (training and testing data), provide their catalog IDs and provide them to the Fybrik Kubeflow Pipeline component, get-data-endpoints, which returns the virtual endpoints for the three data sets (training, testing, results). The training and testing data sets are the inputs, whereas the third data set is the location for writing the output. Fybrik automatically allocates the storage for the output, saving the user from having to deal with this task as well.

When your workflow reads the training and testing data via the virtual endpoints, Fybrik behind the scenes automatically enforces the pre-defined data governance rules that were defined for the enterprise by the data governance officer via a governance engine tool. You also do not need to provide any credentials!

When your workflow writes the results, Fybrik automatically finds available storage options and based on governance and enterprise preferences (defined as IT config policies by an admin) writes them to storage and registers the results in the data catalog, returning to you the new catalog ID.

A sample workflow that estimates housing prices demonstrates how to use the Fybrik Kubeflow Pipeline get-data-endpoints component. The architecture and a demo are described in detail here.

Kubeflow Pipeline Leveraging Fybrik

Please feel free to try out the component in one of your pipelines or run our sample on your own, and let us know what you think!

We are happy to discuss via the #kubeflow-pipelines slack channel (please tag me @Sima Nadler) or via Fybrik discussions.

--

--

Sima Nadler
fybrik
Editor for

IBM Research. Expert in privacy & hybrid cloud data protection. Opinions expressed are my own.