Maximizing Data Analytics Performance with IBM Data Virtualization Series: Best Practices and Optimization Techniques
By Jun Liu and Changsheng Liu
The adoption of hybrid and multi-cloud data platforms in modern IT architecture has led to new challenges in using enterprise data. With data dispersed across various cloud platforms and IT systems, data silos have become increasingly common. However, business needs remain urgent, and data consumers require faster ways of accessing and using data in BI/AI applications.
To address these challenges, Data Fabric provides an abstraction layer for sharing and using data across hybrid and multi-cloud landscapes. Data virtualization is a key component of Data Fabric and enables easy access to data silos without requiring specialized skills. However, accessing virtualized data in an uncertain environment can be challenging. To optimize virtualized data access, particularly for queries in complex environments, several key techniques must be used.
Writing efficient queries to reduce overhead
In modern data-driven businesses, efficient query writing is crucial to reduce overhead and optimize performance. This applies not only to traditional database systems but also to data virtualization systems. To achieve optimal performance, it is important to follow some fundamental SQL optimization rules.
- It is recommended to select only the necessary information while querying the database. This means limiting the number of columns to be retrieved and avoiding unnecessary joins. It is also important to use appropriate data types for each column to reduce memory usage and improve query performance.
- It is advisable to avoid the use of SQL temporary tables unless they are absolutely necessary. Temporary tables can significantly impact query performance, especially when dealing with large data volumes.
- Filter predicates should be used wisely to reduce the number of rows scanned during query execution. This can be achieved by applying filtering conditions as early as possible in the query execution plan.
- When querying data from multiple data sources, it is recommended to avoid table joint queries as much as possible. If table joins are must, it is better to use high selective data columns in join predicates.
- Using optimizer hints or guidelines can help optimize query execution plans. However, these should be used with caution and only when necessary, as they can sometimes lead to suboptimal performance.
- Keeping data encoding consistent across different data sources can help avoid unnecessary codepage conversions, which can significantly impact query performance.
By following these SQL optimization rules, it is possible to achieve better query performance in data virtualization systems and other database systems, reducing overhead and improving overall efficiency.
Design reasonable database schema to accelerate remote query
A well-designed database schema can not only improve the efficiency of local queries, but it can also have a positive impact on remote queries. When a virtual table is created, it is designed to map to a specific relational table within the data source. In order for this mapping to work correctly, the data types of the virtual tables must match those of their corresponding columns in the data source table. This ensures that the data is properly stored and retrieved, minimizing the chance of errors occurring during data transfers.
When it comes to optimizing virtual tables for remote queries, it is important to consider the length of string columns. Keeping string lengths as small as possible, while still meeting business requirements, can significantly improve query performance. Similarly, if numeric columns are of the same type, queries can execute more quickly as well.
Data virtualization can also serve as an application client of the data source, and it can help build an index on the data source that aligns with the query business requirements. This index can then be used by the query optimizer to generate a better query plan, which leads to faster query execution.
Constraints can also be applied to virtualized tables to limit a column’s valid values within a specified range. The query optimizer can then use these constraints to select faster access paths, further improving query performance. Overall, designing a database schema that is optimized for remote queries can have a significant impact on the efficiency of data transfers and the performance of queries executed on that data.
Collect statistics to improve the cost model estimation
One way to improve the efficiency of queries executed on virtual tables is to collect statistics on those tables. By analyzing these statistics, the optimizer can better estimate the cost of various access plans and ultimately determine the optimal access plan for a given query. Db2 federation technology is the basis for data virtualization, and this technology includes several query optimization techniques that can facilitate virtualized data access.
Collecting statistics on virtual tables can provide the optimizer with valuable information about the size of the table, the distribution of its values, and the filtering factors that may apply. These statistics can help the optimizer determine the most efficient way to execute a query on that table. For this reason, it is essential to collect virtual table statistics in order to optimize query performance.
There are two primary methods for collecting statistics on virtual tables: remote catalog access and remote data collection. Remote catalog access involves directly accessing the catalog tables of the remote data source to obtain statistics about the virtual table. This method can be particularly efficient if the remote data source has routines in place for collecting statistics. On the other hand, remote data collection involves collecting statistics on the virtual table itself, using the same methods as would be used for a non-virtual table. However, remote data collection can be time-consuming, especially when the virtual table contains a large volume of data. In such cases, it may be necessary to use sampling technology to improve the collection performance. This involves collecting statistics on a representative subset of the data rather than the entire virtual table.
Ultimately, the choice between remote catalog access and remote data collection, and whether to use sampling technology, will depend on the specifics of the virtual table and the data source being accessed. Regardless of the method chosen, collecting statistics is an important step in optimizing query performance in data virtualization environments. By providing the optimizer with detailed information about the virtual table, these statistics can help ensure that queries execute as quickly and efficiently as possible. Reference document:
Facilitate query and predicate pushdown
Data virtualization uses customizable parameters to enable query and predicate pushdown, as well as to describe the functional configuration and operation rules of the data source. Most of these parameters are automatically set by the system, but some can be set dynamically to adapt to the data source’s functional behavior.
One such parameter is COLLATING_SEQUENCE, which is used to indicate the character sorting rules of the data source and the DV database. Its valid values include Y, I, and N, and it affects simple filter conditions for character comparison, Join conditions, and the pushdown of SQL operations including ORDER-BY, GROUP-BY, DISTINCT, and UNION.
Y — The character sorting rules of the data source and DV database are consistent
I — the data source is case-insensitive
N — the character collation of the data source and the DV database are inconsistent
How to set: the collation of the DV database is UTF-8 Binary, if the collation of the data source is consistent with the DV database, then set it to Y; if the data source is case-insensitive (such as a=A), set it to I; otherwise set to N. Usually, the general character sorting of the case-insensitive data source is compatible with the DV database, so it is set to Y. An understanding of the character set collation of various data sources can help to set this parameter accurately.
Db2 related documents:
The second paramter is VARCHAR_NO_TRAILING_BLANKS, which indicates that there is no trailing space in the VARCHAR type data of the data source. If there is no trailing space in the data, even if the data source is ordinary Db2 (the trailing space is not sensitive), simple filter conditions for VARCHAR character comparison, Join conditions, and include ORDER — SQL operations such as BY, GROUP-BY, and DISTINCT can be pushed down.
Its valid values include:
Y — the VARCHAR character of the data source does not contain trailing spaces
N — The data source VARCHAR characters contain trailing spaces
How to set: Set this parameter according to its value meaning.
The third paratmer is SAME_STR_COMP_SEMANTICS, which indicates that when the CHAR/VARCHAR type data of the data source participates in the comparison, the rules are consistent with the DV database, including whether it is sensitive to trailing spaces, and NULL comparison is consistent.
Its valid values include:
Y — The data source character comparison rules are consistent with the DV database;
N — The character comparison rules of the data source are inconsistent with the DV database
How to set: Set this parameter according to its value meaning.
It’s important to note that customizable parameters can vary across different data sources, and understanding the character set collation of various data sources can help set these parameters accurately. These parameters can be set according to their value meaning and are applicable to all data sources.
Design proper data caches
Designing effective data caches is a critical aspect of optimizing data analytics in virtualized environments that involve multiple data sources located both on-premises and in the cloud, where network latency can be uncertain. In such scenarios, data caching allows users to create a local store for frequently accessed data, such as basic tables and SQL query results, which can significantly improve query performance.
The key benefit of data caching is that it reduces the need to repeatedly access the remote data sources, which can be slow and resource-intensive, and instead, provides a faster access to the locally stored data. Caching is particularly useful when the data sources are relatively stable and the real-time requirements for the queries are not too demanding. In some cases, when remote data sources may not want to be connected, especially during peak business hours, caching can help maintain query performance without any interruptions.
To design effective data caches, users can set the cache refresh cycle to match their business needs. By regularly refreshing the cache, users can ensure that the cached data remains up-to-date and relevant. Users can also monitor the execution plan of the query to check whether the cache is being used effectively or not.
Overall, designing proper data caches is a crucial step towards optimizing data analytics in virtualized environments with multiple data sources. By reducing the need to access remote data sources repeatedly, caching can significantly improve query performance and allow for more efficient data analysis. Reference document:
Summary
The article is the first article of the series. It discusses the challenges of accessing and using data across hybrid and multi-cloud landscapes and how Data Fabric provides an abstraction layer for sharing data. It then generally outlines techniques for optimizing data access through efficient query writing, reasonable database schema design, collecting statistics to improve cost model estimation, facilitating query and predicate pushdown and designing proper data cache. By following these techniques, it is possible to achieve better query performance in data virtualization systems and other database systems, reducing overhead and improving overall efficiency.