In-Depth Benchmarking: Databricks SQL Connector vs. SQLAlchemy

Databricks SQL SME

Published in

DBSQL SME Engineering

7 min readJul 4, 2024

Benchmarking Total Query Time in Seconds

Authors

Claudio Tito, Delivery Solutions Architect @ Databricks
Gary Diana, Solutions Architect @ Databricks

Introduction

Tools that streamline and enhance interaction with big data platforms are pivotal in the rapidly evolving landscape of data engineering and analysis. This blog presents a comprehensive benchmarking exercise conducted on Databricks Notebooks where the native Databricks SQL Connector for Python (DBSQL Connector) and the SQLAlchemy dialect are compared under identical conditions, focusing on their performance with an increasing number of row queries without any performance tuning.

Background

Numerous Databricks customers rely on the DBSQL Connector, which allows the execution of SQL commands across Databricks clusters and warehouses and uses Thrift, a framework often adopted to build scalable, multi-language applications. Its design (adhering to PEP 249 — Python Database API Specification v2.0) ensures a seamless, more performant, and straightforward setup compared to other Python SQL libraries like pyodbc.

https://docs.databricks.com/en/dev-tools/python-sql-connector.html

Simultaneously, the DBSQL Connector supports a specialized SQLAlchemy dialect — extending the popular Python SQL toolkit by providing Object-Relational Mapping (ORM): an abstraction layer that allows users to interact with database objects in a programming language rather than by writing SQL statements. The ORM aspect of SQLAlchemy prioritizes database interaction portability across different vendors rather than peak performance, distinguishing it from the more direct DBSQL connector.

https://docs.databricks.com/en/dev-tools/sqlalchemy.html

By testing these connectors in their “bare” forms, we intend to highlight their fundamental capabilities and performance characteristics, providing valuable insights into their optimal use cases in real-world scenarios. The results from these benchmarks are designed to guide decision-makers in selecting the appropriate SQL dialect for their specific needs, balancing between performance efficiency and flexibility.

Methodology

To ensure a comprehensive assessment, the duration of several key events, such as connection establishment, query execution, result fetching, and total processing time, was measured under strict conditions, without performance tuning or client-side caching, to accurately represent the out-of-the-box experience and intrinsic capabilities of each dialect without enhancements.

The benchmark tests were designed and executed within the Databricks environment, specifically within the same regional settings as the Delta tables:

This choice was made to minimize potential network latencies that could skew the data.
The tests varied in complexity, involving queries that fetched different numbers of rows — from 100,000 to 8 million.

One hundred twenty tests were conducted, generating 20 distinct data points for each row limit set in the queries. We recorded the size of the serialized results set, providing insight into the efficiency of data retrieval and manipulation by each SQL dialect.

All the above benchmarks were completed by using a Large DBSQL endpoint. The Python code ran on an i3.xlarge single-node cluster with 30GB of RAM and 4 CPU cores, running DBR 13.3LTS with Apache Spark™ 3.4.1 and Scala 2.12.

Testing Framework

We wrote two dedicated functions to measure and compare the performance of the DBSQL Connector and the SQLAlchemy dialect. These functions were designed to capture the timing data for connection, execution, fetch, and total processing time for various query limits.

DBSQL Connector Performance Measurement:

The function measure_databricks_sql_performance is engineered specifically for the DBSQL connector. This function turns off server-side caching to ensure the performance metrics reflect the connector’s raw capabilities without any enhancements from cached results.

def measure_databricks_sql_performance(query: str) -> Optional[Dict[str, float]]:
   start_time = datetime.now()
   try:
       with sql.connect(**dbsql_config) as connection:
           with connection.cursor() as cursor:
               execution_start = datetime.now()
               cursor.execute(disable_cached_results_str)
               cursor.execute(query)
               execution_end = datetime.now()
               data=cursor.fetchall()
               fetch_end = datetime.now()
              
               df = pd.DataFrame(data) 
               serialized_df = pickle.dumps(df)
               data_size_bytes = len(serialized_df)
               data_size_mb = data_size_bytes / (1024 ** 2)
      
       timings = {
           'result_size_mb': data_size_mb,
           'connection_time': (execution_start - start_time).total_seconds(),
           'execution_time': (execution_end - execution_start).total_seconds(),
           'fetch_time': (fetch_end - execution_end).total_seconds(),
           'total_time': (fetch_end - start_time).total_seconds()
           }
       return timings
   except Exception as e:
       logger.error(f"Error in Databricks SQL performance measurement: {e}")
       return None

SQLAlchemy Performance Measurement:

In contrast, the measure_sqlalchemy_performance function handles the SQLAlchemy dialect by constructing the database URL dynamically and managing database connections through the SQLAlchemy engine.

def measure_sqlalchemy_performance(query: str) -> Optional[Dict[str, float]]:
   db_url = f"databricks://token:{dbsql_config['access_token']}@{dbsql_config['server_hostname']}?http_path={dbsql_config['http_path']}"


   start_time = datetime.now()
   try:
       engine = create_engine(db_url, **connect_args)
       with engine.connect() as conn:
           execution_start = datetime.now()
           conn.execute(text(disable_cached_results_str))
           result = conn.execute(text(query))
           execution_end = datetime.now()
           data=result.fetchall()
           fetch_end = datetime.now()
          
           df = pd.DataFrame(data) 
           serialized_df = pickle.dumps(df)
           data_size_bytes = len(serialized_df)
           data_size_mb = data_size_bytes / (1024 ** 2)


       timings = {
           'result_size_mb': data_size_mb,
           'connection_time': (execution_start - start_time).total_seconds(),
           'execution_time': (execution_end - execution_start).total_seconds(),
           'fetch_time': (fetch_end - execution_end).total_seconds(),
           'total_time': (fetch_end - start_time).total_seconds()
           }
       engine.dispose()
       return timings
   except Exception as e:
       logger.error(f"Error in SQLAlchemy performance measurement: {e}")
       return None

Key Implementation Details:

Caching Disabled: We explicitly turn off results set caching to isolate the performance of the SQL execution.
Universal Data Serialization: Both functions use pickle to serialize the data frames into bytes, ensuring that the size measurements are consistent and comparable between the two SQL dialects.

Consistent Timing Metrics: Each function measures and returns detailed timing information, capturing the nuances of each step in the database query process across different query sizes.

Results

The benchmark results revealed a satisfying amount of consistency in the performance of the DBSQL Connector for Python when used alone versus with SQLAlchemy. Here, we discuss the key findings, focusing on each component of the query process, from connection to fetching results, for queries increasing from 100,000 to 8,000,000 rows. Please note that our test results will be visualized in Databricks Lakeview.

Key Performance Metrics

Connection Times: SQLAlchemy and DBSQL Connector established connections reasonably quickly, with SQL Alchemy taking 0.193 seconds and DBSQL taking 0.071 seconds. This difference is negligible in practice for most scenarios, though it may factor into the decision for low-latency use cases.

Execution Times: It should be noted that because our query benefits from disk cache (https://docs.databricks.com/en/optimizations/disk-cache.html), the execution time isn’t a very meaningful metric. For completeness, we present it below:

Fetch Times: This metric varied more significantly as the row count increased. The DBSQL Connector consistently outperformed SQLAlchemy, which aligns with expectations due to the overhead associated with SQLAlchemy’s ORM.

Total Times: Fetch times were the most significant component of total times. While both connectors scaled well, DBSQL consistently showed faster results, underscoring its efficiency in handling large datasets without performance tuning. However, it is important to remember that this includes execution time, which, as described above, isn’t meaningful in this situation.

Data Handling and Serialization: The benchmarks also measured the serialized data size, which remained consistent across both connectors for each limit, peaking at about 1.34 GB for 8 million rows. This consistency is essential for contextualizing timing results, as it ensures that the connector’s assessment is based on handling equivalent data volumes.

Analytical Observations

ORM Overhead: As anticipated, SQLAlchemy’s ORM layer introduces some overhead, which is particularly visible in fetch times. This overhead is a trade-off for the flexibility and developer productivity benefits SQLAlchemy provides. However, considering the volume of rows and the size of data being handled, this overhead remains within an acceptable and manageable range.
Performance Without Tuning: Remarkably, both connectors delivered robust performance without any specific performance tuning, efficiently handling up to 500,000 rows in seconds and managing over 800 MB of data.
CloudFetch: DBSQL and SQLAlchemy’s strong performance in our benchmarks is significantly influenced by their utilization of the CloudFetch feature. CloudFetch optimizes the retrieval of large query results by leveraging cloud storage within the Databricks environment.
https://learn.microsoft.com/en-us/azure/databricks/integrations/odbc/capability

Conclusions

This blog examined the performance of the Databricks SQL Connector, both on its own and when integrated with SQLAlchemy. Testing with data volumes exceeding 1 GB demonstrated the connector’s efficiency in managing large result sets, and although SQLAlchemy slightly reduces performance, the difference is minimal. The additional features and ease of use offered by SQLAlchemy often justify this minor performance impact in many real-world scenarios:

Databricks recommends using SQLAlchemy when its features are beneficial or for audiences already comfortable.
The standalone Databricks SQL Connector is the better choice for scenarios requiring maximum performance or where SQLAlchemy’s features are unnecessary.

Further research could investigate how various factors impact performance by examining the effects of network latency, which could provide a more nuanced understanding of each tool’s performance under diverse conditions.