How ING and IBM are collaborating to manage enterprise data across multiple clouds

Published in

fybrik

6 min readJun 22, 2021

Co-authored with Cong Chen and Mike Nicpan, ING

IBM and ING have been working together on how Fybrik can accelerate the use of data without sacrificing security and usability for real life use-cases.

Fybrik (previously called Mesh for Data) is an infrastructure level solution that makes the lives of data scientists, business analysts, and other users of enterprise data much easier. For ING, the challenge addressed by this pilot was how to become a data driven company for providing better customer insights while fulfilling the increasing regulatory pressure. What made it so hard? In ING’s current Data Lake landscape, data ingestion, and exchange across borders were not efficient and smart enough to make full use of the data. With the many vital rules and regulations in place to protect the privacy of sensitive data, many businesses had to jump through numerous hoops before using the data to improve their customer experience, to provide better results, or even share it with their analysts in another country. Normally, data users would have to confer with a legal expert or a data governance officer to understand if they are allowed to use the data and check with IT specialists to allocate storage and prepare the data correctly. In this blog, we will describe a real-world banking use case from ING and how Fybrik from IBM Research enables the use of data across borders in a secure, automated, and compliant manner.

With Fybrik, business analysts or data scientists can simply fill in the context in which they want to use the data and all the governance enforcement and data preparation takes place automatically behind the scenes. From the perspective of governance officers and IT specialists, life is much easier as well. They can provide their guidance as rules at the enterprise level, rather than on a case by case, and Fybrikensures they are enforced.

A previous blog described the vision for Fybrik (previously Mesh for Data), its motivation, main concepts, and the principals upon which it is built.

The solution to sharing sensitive data while protecting privacy

In today’s dynamic financial business environment, data is one of the most valuable assets. It lets banks stay competitive while retaining the trust of their customers. With the need to protect personal data, it is crucial that ING maintain the highest level of protection for its customers’ privacy. As a multinational business, ING processes a large volume of personal data for different local and global purposes. When data is transferred across borders, sometimes over long distances, it becomes a challenge to control, secure, and govern personal data assets in an efficient and compliant way. This will be especially true in the near future when more data is stored, generated, and used on public clouds. Given the complex data lake that ING has today, the company was looking for a solution that would simplify the landscape by consistently controlling the data access, centrally enforcing enterprise data policies, and intelligently moving data for the different data users.

IBM recognized this same challenge and its relevance across industries. Together IBM and ING are designing and developing the Fybrik solution, initiated by IBM Research, to meet ING’s data challenges. ING believes that the future of data management on clouds depends on enabling consistent data governance and making data available to users no matter where it resides.

Follow Yusuf, Serena, and Eva to understand how Fybrik works

To demonstrate Fybrik, let’s follow two user journeys with three main actors. Yusuf is the owner of data residing in one of the bank’s operational systems. He wants to make his data available for business analysts and data scientists via ING’s data lake. Serena is a governance officer responsible for ensuring that the data in ING’s data lake is used appropriately. And, Eva is a data scientist who is building a fraud detection machine learning module to identify customers whose behavior may be fraudulent.

Journey 1 — Ingest

The data that Yusuf would like to contribute to ING’s data lake contains confidential information about Turkish residents. Both Turkish law and the European data protection laws dictate how this data must be treated. These guidelines have been codified in the policy manager by Serena, the data governance officer. Using Fybrik, Yusuf specifies the data he would like to ingest into the data lake. In addition to the endpoint information about where the data resides and the credentials for accessing it, he also provides the metadata indicating that the data set contains confidential information of Turkish residents. He submits the request for ingest, and Fybrik automatically handles what used to be a tedious manual process. Fybrik confers with the policy manager and is informed that Yusuf’s data must be stored in the Turkish instance of ING’s data lake. It allocates a bucket in Turkey in which to store the data, copies the data there, and then register’s the data in ING’s enterprise data catalog.

Here’s the magic. Yusuf did not need to know or confer with anyone about the different laws governing where his data should be stored. He did not need to contact an infrastructure operator to allocate storage in the data lake, and he did not need to register the asset in the data catalog. It was all done for him behind the scenes.

Journey 2 — Data Usage

Eva is building a fraud detection machine learning model, and found the data set provided by Yusuf in the data catalog. She resides in the Netherlands and the data is in Turkey. The policies defined for ING by Serena indicate that Turkish data may only be used for the purpose of fraud detection outside of Turkey if the confidential data is masked. Eva, however, is not a governance expert and is not aware of this. Using Fybrik, Eva specifies the data set, indicates that she wants to use the data for fraud detection, and that she would like to consume the data in arrow-flight format via her Jupyter Notebook. After submitting her request, the Fybrik control plane prepares the environment and the data required. It confers with the policy manager, which indicates that the confidential data in the data set must be masked. Fybrik invokes an implicit copy module, which makes a temporary copy of the data in the Netherlands and masks the necessary data before sending it from Turkey to the Netherlands. In the Netherlands, Fybrik deploys an arrow flight service, which is used to access the data. Fybrik then provides Eva with a virtual endpoint, which she uses from within her Jupyter Notebook to read the data.

Eva does not have to know that a temporary copy of the data was made, and she does not need credentials to access the data because Fybrik manages the access to the data for her. All this makes Eva’s job easier while providing a higher level of data security. She uses the data for fraud detection, and when she is done deletes the Notebook and deletes the Fybrik environment. Once this is done, Fybrik automatically deletes the temporary copy of the data.

The result? A process that would have taken many weeks was accomplished automatically with the click of a button. This use case demonstrates important aspects of the data fabric: location-independent data access, self-service and, integrated governance and security.

Technical Approach — Behind the scenes

Fybrik is an open source infrastructure-level platform that can be easily customized and extended. In the collaboration with ING, IBM’s Watson Knowledge Catalog was used as the data catalog and the policy manager. Data was read by Eva via the Apache Arrow Flight server provided as an example with Fybrik. During ingest, the data was copied using IBM’s Data Stage ETL engine. The Fybrik module for invoking the ETL was contributed by ING. The temporary copy made when Eva accessed the data was done by the open source data movement operator.

A description of the Fybrik architecture can be seen here.

In a recent talk in Think 2021 we discussed the details of the use-cases and the validation.

Think 2021 Session

If you are interested in piloting Fybrik as was done with ING, or are interested in contributing to the open source project please feel free to reach out in GitHub Discussions or send email to sima@il.ibm.com or ronenkat@il.ibm.com. For ING related questions please contact cong.chen@ing.com or mike.nicpan@ing.com.

How ING and IBM are collaborating to manage enterprise data across multiple clouds

Written by Sima Nadler