Boosting Databricks ODBC Driver Performance

Patrick Pichler
Creative Data
Published in
6 min readOct 25, 2022

--

The top 3 aspects to consider to get the most out of the Databricks ODBC driver by interacting with Delta Lake tables

Photo by Matthias Speicher on Unsplash

Introduction

Low-latency queries has always been an important topic when it comes to data platforms, but in the last few years it has noticeably regained momentum. This is due to the growing amount of data together with the overwhelming cloud data platform offerings to cope with it.

While most of those new platforms certainly reduce query response times over LARGE data volumes, the technology for client applications to interact with them is still mostly dominated by over 25 years old standards such as JDBC or ODBC, originally designed without taking into account distributed systems, object stores or columnar (in-memory) storage formats. In its original form, they require expensive converting/serializing processes by exporting data from a system-specific columnar format via the JDBC/ODBC row format and then maybe even converting back again to the client’s columnar format¹.

The good news is now that we have lately been making some progress to establish new standard interfaces such as Apache Arrow Flight SQL, exchanging data much more efficient by leveraging latest technologies. However, while we wait for them to be commonly used, some data platform providers have also come up with some other improvements around existing interfaces. For instance, Databricks around the ODBC interface that the Spark Thrift server provides, especially speeding up the extraction of large data volumes. In this article, we will take a look at how these improvements work and what to bear in mind to actually take advantage of it.

1. ODBC Driver Setup and Config

First and foremost, make sure that you always use the latest Databricks ODBC diver version available since it’s constantly being improved and updated. In this context, also the configuration of the driver can make quite a difference, especially playing around with the Advanced Driver Configuration Options. For instance, if you deal with long STRING columns make sure to keep an eye on the DefaultStringColumnLength setting otherwise it could lead to string truncation and hence also very poor performance — try to avoid long string columns in general. Another interesting setting to change could be MaxBytesPerFetchRequest depending on your network which specifies the maximum number of bytes to retrieve from the server for every FetchResults API call when using Apache Arrow for serialization, more regarding this in the next section. As a broad orientation, next are listed some of the default configuration settings used by Power BI to connect to Databricks via ODBC:

Driver = “Simba Spark ODBC Driver”,
Host = ValidatedHost,
HTTPPath = httpPath,
Port = 443,
ThriftTransport = 2,
SparkServerType = 3,
SSL = 1,
UseNativeQuery = 0,
UserAgentEntry = “PowerBI”,
UseSystemTrustStore = 1,
RowsFetchedPerBlock = 200000,
LCaseSspKeyName = 0,
ApplySSPWithQueries = 0,
DefaultStringColumnLength=65535,
DecimalColumnScale = 10,
UseUnicodeSqlCharacterTypes = 1

2. Cloud Fetch and Apache Arrow

As already mentioned, extracting/fetching the results is the most expensive part by querying large data volumes. This is due to data serving happening single-threaded together with the data serialization/deserialization processes happening along the way by using ODBC in its original form. This is where Cloud Fetch and Apache Arrow comes to rescue by fetching data in parallel via HTTP directly from the cloud storage through a set of URLs provided by the Databricks endpoint (e.g. Shared Access Signatures in Azure, presigned URLs in AWS). This way it can be taken advantage of parallelism for data serving as well as Apache Arrow for providing the data in an optimized serialization format leading to an accelerated data interchange with very low overhead. For the ODBC interface to use it, you just need to make sure to have installed at least the ODBC driver version 2.6.17 and to use the Databricks Runtime 8.3 or above. If this is the case, everything should be installed and set up automatically. Just be careful with complex data types such as MapType, ArrayType or StructType since Apache Arrow doesn’t support these types at the time of this writing.

If you still have doubts about whether Cloud Fetch is really used, you can check the last element mapPartitionsInternal of the query execution DAG of your corresponding job, it should contain CloudStoreCollector:

Image by Author

You can also check the Databricks driver log Log4j output for events such as HybridCloudStoreResultHandler and ArrowResultHandler. This would be even more specific since Cloud Fetch is finally only used for query results larger than 1 megabyte and the DAG above would still tell you CloudStoreCollector even though the query just returns 5248 bytes with 10 rows as you can see in the next image:

Image by Author

3. Photon and Disk Caching

Another very easy way to speed up queries without changing a single line of code is choosing instance types supporting the new Spark execution engine Photon written in C++ together with accelerated disk caching. Even though this is not directly related to the ODBC driver, you will immediately notice impressive performance improvements by using these two proprietary Databricks features, especially when data is already cached on the nodes’ local SSD volumes using a fast intermediate data format.

If activated, this caching mechanism happens automatically by creating a local copy whenever a remote Parquet data file (including Delta Lake tables) has to be fetched from a remote storage. However, you can also actively “warm up” the cache by using the CACHE SELECT command. In this case, all data files will be loaded to the local machines and any upcoming changes are synchronized automatically. To activate it, the following settings need to be configured:

spark.databricks.io.cache.enabled true
spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false

The probably most important setting here is maxDiskUsage specifying the disk space reserved per node for caching, you can watch the current usage in the Storage section of the Spark Web UI:

Image by Author

To check whether your query is supported by Photon, just run the EXPLAIN command returning you the execution plan. Towards the end of it, you can find a section called “Photon Explanation”. In this example, Photon is not kicking in since the query is just too small, but there could be other reasons:

Image by Author

Conclusion

Of course, on a physical data layer, there are lots of other Databricks optimization techniques and Delta Lake best-practices to improve performance, some of them even take place automatically. However, no matter how good your physical data layer may be optimized and work within Databricks, your client application still needs to extract/fetch the data results via some interface. This is why you should keep the 3 mentioned aspects above in mind to reduce this bottleneck to a minimum.

Resources:

[1]: Dremio. 2022. An Introduction to Apache Arrow Flight SQL. [ONLINE] Available at: https://www.dremio.com/subsurface/an-introduction-to-apache-arrow-flight-sql/. [Accessed 23 October 2022].

--

--

Patrick Pichler
Creative Data

Promoting sustainable data and AI strategies through open data architectures.