Performance Tuning tips for IBM Match 360
This blog on performance tuning of Match 360 on IBM Cloud Pak for Data are being authored with the collective inputs from the IBM Financial Crimes and Insights (FCI) team and based on their experience with IBM Match360 product on IBM Cloud Pak for Data.
Special thanks to the reviewers of the series Abhishek Seth, Product Architect - Master Data Management Engineering, Master Inventor, Data & AI ,Rishi S Balaji , Application Architect Data & AI and Srinivasan Muthuswamy, Senior Technical Staff Member IBM India Software Labs .
Introduction
IBM Match 360 with Watson on IBM Cloud Pak for Data (here after referred to as IBM Match 360) seamlessly consolidates data from disparate sources to establish a single, trusted, 360-degree view of your customers. IBM Match 360 includes cloud-native, machine learning-assisted, self-service analytics and matching tools that deliver business insights.
IBM Match 360’s micro services architecture provides APIs for several operations. When IBM Match 360 is installed , it comes with a set of default configurations which are good enough to try some sample runs of the API with small volumes. However, for enterprise applications which involve millions of parties and watchlist records to be processed, the product needs to be fine tuned for performance. Without these configurations operations like the data load, derive , watchlist matching and entity resolution will perform extremely slow and will take several days to complete. The blog provides the details of these performance tuning configurations.
The product versions on which these configurations are tested are :
IBM Match 360 version:1.1.188
IBM Cloudpak For Data version :4.0.6
IBM Openshift version :4.8.47
This blog assumes the reader has a basic understanding of how to use IBM Match 360.
Identifying poor performance :
- Data load jobs are taking too long. For example loading 1 Million takes more than 20 minutes.
- Derive jobs keep failing or keep running for several hours.
- Matching operation never completes.
Performance tuning configurations
The configurations mentioned below need to be updated in the mdm-cr.
- Fine tune BulkLoad : The Bulk load API(/mdm/v1/bulk_load) provides a way to load the data into the IBM Match 360. Internally this is a Spark job that does the load. By default this job is configured to uses 4 executors to load the data . To improve the performance of the bulkload, these executors can be increased to 8 or more based on the available resources in the cluster/node. This configuration is typically used to process data volumes 30 Million.
The IBM Match 360 Custom Resource Definition (mdm-cr) can be edited from the IBM Openshift console as shown below:
· login to IBM Openshift console
· from the left side navigation menu , expand Administration
· select CustomResourceDefinitions ,
· search for MasterDataManagement and click on MasterDataManagement,
· select Instances tab and click on mdm-cr
· select YAML tab and add the below configuration under spec aligned with common-services add the below configuration .
it increases the data load spark executors from 4 to 8 while loading the data.
1. Fine tune Derive Process When data is loaded, IBM Match 360 will standardize the data through a process called Derive. The derive process uses Foundation DB and Elastic Search as part of its operation. It reads the data to be processed from Foundation DB and writes the processed data to Elastic Search. To improve the performance of the Derive process we need to allocate more resources and storage space to the Foundation DB and Elastic Search pods.
Just like it was mentioned earlier the mdm-cr needs to be updated once again.
Matching: The Match360 matching process involves assessing the suspect candidate pool to determine how closely each suspect party matches with the incoming, or source, party.
Just like it was mentioned earlier the mdm-cr needs to be updated once again.
Note:
Please note that the performance tuning numbers provided in this blog are based on the volume of data and cluster that is used to perform the operations. Based on your data , response time requirements and cluster size, these configurations have to be tweaked for optimum performance.
Conclusion
This blog showed some performance tuning tips to improve the performance of some of the commonly used IBM Match 360 processes.
These configurations when used with a volume of 30 million gave performance of 8 hours for bulkload. Similarly the derive process took 1 .5 hours with these configurations for volume of 30 million data. The matching job performance improved 5 times with these configurations.