IBM and ING Optimize Data Usage Across Clusters

Published in

fybrik

6 min readMar 20, 2023

Co-Authored with Maryna Strelchuk, Shlomit Koyfman and Ziv Nevo

Introduction

Fybrik works behind the scenes making the lives of data scientists, business analysts, and other users of enterprise data much easier. It automates the enforcement of data governance regulations and generates optimal data planes, removing the need for the many manual steps and processes required today.

In a previous blog we demonstrated multi-cluster governance enforcement flows. In this blog we will highlight new work done jointly by IBM and ING, a global financial institution, on optimizing the data planes based on IT constraints, in addition to governance policies.

Business Challenge

While keeping the enterprise safe, secure and compliant is one of the top priorities, there are also major challenges regarding how best to leverage the available IT infrastructure available for a particular workload and its data needs.

Most global financial institutions, including ING, have a complex data landscape that often covers multiple geographies. Consequently, many vital processes to keep the bank safe and compliant, such as Know Your Customer (KYC), rely on using data located in different corners of the world.

These processes carry out customer due diligence checks, screen customer transactions, monitor transactions and report suspicious activities. However, moving massive amounts of data is expensive and time consuming. We looked for a solution to optimize the capabilities of our infrastructure while minimizing costs where possible. To that end ING and IBM Research partnered to add automatic IT optimization into Fybrik.

Technical Challenges

Different workloads have different needs. For example, each reads and writes data of different data formats via different access protocols.

Optimizing the way a given workload reads and writes data is dependent on many parameters, such as:

Format and protocol requested by the workload
Geography/country in which the workload is running
Data governance regulations that dictate who/when/for what/where the specific data sets are allowed to be used, copied, and/or written
Clusters/clouds available for use
Storage available for writing data and making explicit or implicit (cache) copies of data
Compliance certifications of the infrastructure — clusters, storage, etc
Costs associated with using the infrastructure — processing, network, storage
Performance of the infrastructure — ex: latency between clusters/storage
Services/capabilities available for inclusion in the data plane

All of these parameters must be taken into account when determining the data plane capabilities and where its components should be deployed.

Technical Solution

Fybrik originally determined (1) the capabilities are required in a given data plane, and (2) in which cluster/cloud should each service contained in the data plane run, and (3) which specific implementation of it should be chosen taking into account data governance policy decisions.

Hard as this is, an even a harder problem is to come up with a compliant data plane that is optimal in terms of infrastructure costs and performance as well as abiding by governance policies. For example, reading data from a distant location has a significant impact on the workload run time, while moving data to another location will add storage costs.

Let’s dive into what we did. To start with, we introduced a JSON object that represents infrastructure attributes, such as storage costs, latency between storage/clusters, etc. In the initial implementation the values are entered manually but we envision these values measured and updated dynamically into Fybrik by other tools.

Infrastructure metrics, which are associated with the infrastructure attributes, have a scale that makes it possible to compare different metrics and estimate their impact on the workload.

In the next step, we defined a new type of policy dubbed IT config policies. Via these policies ING’s IT administrator can define the enterprise priorities and preferences regarding how IT infrastructure should be utilized, based on the organization needs and the workload characteristics.

In the demo use-case three types of optimization policies have been defined:

Minimization of storage costs for a development workload
Minimization of read latency for a high priority production workload
Balance between storage costs and latency for less prioritized production workloads

To the Fybrik control plane we added a new component called the Optimizer. It takes the metadata associated with the workload, the requested data, data governance decisions and IT config policy decisions and automatically generates the optimal data plane. The Optimizer translates the problem into a Constraint Satisfaction Problem and submits it to a CSP engine that identifies the components that provide the optimal solution. From there the relevant data plane is automatically deployed, with each component running on its designated cluster and data copied/moved to the best storage when necessary.

Pilot Use Case

In the previous Fybrik pilot we focused on exploring how governance enforcement flows could work in the multi-cluster environment. In the pilot described in this blog we additionally focused on Fybrik’s ability to optimize infrastructure capabilities and minimize the costs based on organization’s requirements.

We implemented a scenario associated with the Know Your Customer use case. The workload that monitors for suspicious transactions ran in the Netherlands and required data from Australia, which is of course physically very far from the Netherlands.

The pilot demonstrated three scenarios. In all three scenarios there are governance policies in place that stipulate that (a) personal information of Australian residents cannot leave Australia without being obfuscated, (b) information about minors cannot be stored in Romania, (c ) finance data cannot be stored in Turkey.

Scenarios

1) A data scientist working on the development of the Know Your Customer machine learning model. We note that this is a development workload, and thus ING did not want to incur unnecessary storage costs. In this case the data remained in Australia, resulting in higher latency for the workload but without incurring any additional storage costs.

2) A production workload of medium priority, in which minimizing distance (associated with latency) was relatively important but so was minimizing storage cost where possible. In this case a temporary copy of the data was made in the UK which was much closer to the Netherlands, and where storage was cheaper than other alternatives in Europe.

3) A high priority production workload where minimizing distance was of utmost importance. For this scenario a temporary copy of the data was made in the Netherlands, ensuring the closest distance despite the high storage costs.

The trade offs between cost and distance were determined by the level of priority of the workload, and of course all governance requirements were taken into account.

Summary

Fybrik now addresses both the governance requirements and the IT infrastructure preferences associated with reading and writing data.

Examples of the resulting business benefits are:

Enterprise control over IT costs, while ensuring high priority workloads require the resources they need to function optimally
Reduction of storage costs, since Fybrik manages the creation and clean up of temporary copies of data when they are required
Reduction of enterprise dark data, since storage allocation and cataloging of data is done automatically by Fybrik when data is written
Increased security, since Fybrik handles credentials to the data set instead of providing them to the users.

Feel free to try Fybrik out! It’s all open source and available here!