Announcing ScalarDB Analytics with PostgreSQL

Published in

Scalar Engineering

7 min readAug 17, 2023

With the recent release of ScalarDB 3.9, we introduced ScalarDB Analytics with PostgreSQL so that users can take advantage of unified analytical queries on databases that ScalarDB manages. This article will explain an overview of ScalarDB Analytics with PostgreSQL and how to use it.

What is ScalarDB Analytics with PostgreSQL?

ScalarDB is a general-purpose transaction manager that allows users to perform transaction processing across multiple databases, including RDBMS, NoSQL, etc. ScalarDB primarily manages workloads that process a large number of simple transactions; thus, it supports a limited class of operations, including key-based CRUD and a simple scan operation. Due to these limitations, if users want to perform advanced processing, such as ad-hoc analysis with the data that ScalarDB manages, users have to implement complex operations, including joins, aggregations, and other operations on the application side on their own.

We introduced ScalarDB Analytics with PostgreSQL to support various queries, including joins and aggregations, against ScalarDB-managed databases and enable users to run advanced processing, such as ad-hoc analysis. ScalarDB Analytics with PostgreSQL, as the name suggests, uses PostgreSQL to execute queries with the data that ScalarDB manages, enabling users to perform various queries that PostgreSQL supports. ScalarDB Analytics with PostgreSQL reads data directly from the database that ScalarDB manages to enable users to read data that ScalarDB has written directly without ETL processing. A foreign data wrapper (FDW), which is a PostgreSQL extension framework for handling external data, is used for this read operation.

The relationship between ScalarDB and ScalarDB Analytics with PostgreSQL is illustrated below. Users can read and write data as usual through ScalarDB, but if they want to perform advanced queries, they can query by using ScalarDB Analytics with PostgreSQL. PostgreSQL reads data directly from ScalarDB-managed databases via FDWs, so ScalarDB Analytics with PostgreSQL can read data directly written to ScalarDB without additional operations.

ScalarDB Analytics with PostgreSQL has several limitations. First, ScalarDB Analytics with PostgreSQL supports only read operations. PostgreSQL itself supports write operations, of course, but it cannot write to a database under ScalarDB management through an FDW. Therefore, attempts to execute commands, such as INSERT, will cause an error. Second, the isolation level guaranteed for reading through ScalarDB Analytics with PostgreSQL is Read Committed, while reading and writing from ScalarDB guarantees the strongest isolation level, Strict Serializable. Under the Read Committed isolation, the rows retrieved on a read are guaranteed to have been committed in the previous transactions, but the results may contain read skew and other anomalies.

How does ScalarDB Analytics with PostgreSQL work?

The architecture of ScalarDB Analytics with PostgreSQL is as follows:

The boxes on the bottom are the databases that ScalarDB manages. To read the data written to these databases from PostgreSQL, ScalarDB Analytics with PostgreSQL uses FDWs. The PostgreSQL community develops and maintains many FDW implementations, such as jdbc_fdw and cassandra_fdw. ScalarDB Analytics with PostgreSQL uses such community-maintained FDWs internally. However, ScalarDB also supports databases that the existing FDWs don’t cover, such as DynamoDB. We developed our own FDW, scalardb_fdw, to read data from these databases. scalardb_fdw uses the Java Native Interface to directly utilize ScalarDB as a library inside the FDW and read data from external databases via scan operations for ScalarDB.

Since scalardb_fdw reads data by using ScalarDB, it can inherently support all databases that ScalarDB supports. For databases that the existing FDWs support, there are choices for the FDW to read data from, including existing FDWs and scalardb_fdw. With ScalarDB Analytics with PostgreSQL, users can use Schema Importer (described below) to determine which FDW to use and configure the settings automatically. Because of this, users don’t need to consider which FDW to use. Schema Importer chooses the existing FDWs preferentially if available; otherwise, scalardb_fdw is used as a fallback. This is because the existing FDWs developed for a particular database often support complex scans and thus provide better performance for complex queries.

In PostgreSQL, users define a foreign table to treat external data as data in a PostgreSQL table. ScalarDB Analytics with PostgreSQL creates one foreign table for each table defined in ScalarDB. These ScalarDB tables contain write-ahead logging (WAL) metadata columns that ScalarDB uses internally and columns that the user defines for the table. ScalarDB Analytics also uses these WAL metadata columns with PostgreSQL to create a WAL-interpreted view.

ScalarDB Analytics with PostgreSQL also creates a WAL-interpreted view for each external table. The WAL-interpreted view is responsible for:

Interpreting the WAL metadata in the ScalarDB table to show only valid rows under the Read Committed isolation level to the user.
Providing the same schema as the table on the ScalarDB side without the WAL metadata.

Because of that, users can consider the WAL-interpreted view as tables equivalent to those in ScalarDB.

Schema Importer is a CLI tool for automatically configuring the above database objects. The tool reads the ScalarDB configuration file, retrieves the schemas of the tables defined in ScalarDB, and creates the corresponding FDW external tables and WAL-interpreted views in that order.

Example of ScalarDB Analytics with PostgreSQL

The following example describes what kind of WAL-interpreted view Schema Importer actually creates.

Suppose the tables dynamons.customer, postgresns.order, and cassandrans.lineitem are created in ScalarDB as follows. Each table is configured to be stored in ScalarDB in DynamoDB, PostgreSQL, and Cassandra.

If you run Schema Importer for those tables, it will create an external table and view for each of the three tables. For instance, for the `dynamons.customer` table, an external table named `dynamons._customer` and a view named `dynamons.customer` will be created.

                            Foreign table "dynamons._customer"
         Column         |       Type       | Collation | Nullable | Default | FDW options
------------------------+------------------+-----------+----------+---------+-------------
 c_custkey              | integer          |           |          |         |
 c_name                 | text             |           |          |         |
 c_address              | text             |           |          |         |
 c_nationkey            | integer          |           |          |         |
 c_phone                | text             |           |          |         |
 c_acctbal              | double precision |           |          |         |
 c_mktsegment           | text             |           |          |         |
 c_comment              | text             |           |          |         |
 tx_id                  | text             |           |          |         |
 tx_version             | integer          |           |          |         |
 tx_state               | integer          |           |          |         |
 tx_prepared_at         | bigint           |           |          |         |
 tx_committed_at        | bigint           |           |          |         |
 before_tx_id           | text             |           |          |         |
 before_tx_version      | integer          |           |          |         |
 before_tx_state        | integer          |           |          |         |
 before_tx_prepared_at  | bigint           |           |          |         |
 before_tx_committed_at | bigint           |           |          |         |
 before_c_name          | text             |           |          |         |
 before_c_address       | text             |           |          |         |
 before_c_nationkey     | integer          |           |          |         |
 before_c_phone         | text             |           |          |         |
 before_c_acctbal       | double precision |           |          |         |
 before_c_mktsegment    | text             |           |          |         |
 before_c_comment       | text             |           |          |         |
Server: multi_storage_dynamodb
FDW options: (namespace 'dynamons', table_name 'customer')


                     View "dynamons.customer"
    Column    |       Type       | Collation | Nullable | Default
--------------+------------------+-----------+----------+---------
 c_custkey    | integer          |           |          |
 c_name       | text             |           |          |
 c_address    | text             |           |          |
 c_nationkey  | integer          |           |          |
 c_phone      | text             |           |          |
 c_acctbal    | double precision |           |          |
 c_mktsegment | text             |           |          |
 c_comment    | text             |           |          |

The dynamons.customer view has the same schema as the original table in ScalarDB. Users can query this view as if it were a table in ScalarDB. Since PostgreSQL supports a variety of analytical queries, users can run, for example, the following queries:

> SELECT
        l_returnflag,
        l_linestatus,
        sum(l_quantity) AS sum_qty,
        sum(l_extendedprice) AS sum_base_price,
        sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
        avg(l_quantity) AS avg_qty,
        avg(l_extendedprice) AS avg_price,
        avg(l_discount) AS avg_disc,
        count(*) AS count_order
FROM
        cassandrans.lineitem
WHERE
        to_date(l_shipdate, 'YYYY-MM-DD') <= date '1998-12-01' - 3
GROUP BY
        l_returnflag,
        l_linestatus
ORDER BY
        l_returnflag,
        l_linestatus;

 l_returnflag | l_linestatus | sum_qty |   sum_base_price   |   sum_disc_price   |     sum_charge     |       avg_qty       |     avg_price      |      avg_disc       | count_order
--------------+--------------+---------+--------------------+--------------------+--------------------+---------------------+--------------------+---------------------+-------------
 A            | F            |    1519 | 2374824.6560430005 | 1387363.5818635763 | 1962762.9341866106 | 26.6491228070175439 | 41663.590456894744 |  0.4150182982456142 |          57
 N            | F            |      98 | 146371.22954200002 |  85593.92837883368 | 121041.52567369482 | 32.6666666666666667 | 48790.409847333336 |  0.4098473333333333 |           3
 N            | O            |    5374 |  8007373.247144971 |  4685645.630765834 |  6624209.157932242 | 24.4272727272727273 |  36397.15112338623 |   0.414759749999999 |         220
 R            | F            |    1461 |  2190869.967642001 | 1284177.8484816086 | 1814150.7929095028 | 25.1896551724137931 |  37773.62013175864 | 0.41323520689655185 |          58
(4 rows)

Wrap-up

This article introduced ScalarDB Analytics with PostgreSQL and how it works. To further enhance the value of data that ScalarDB manages, Scalar plans to enable data analysis with PostgreSQL and in various other environments in the future.

Announcing ScalarDB Analytics with PostgreSQL

What is ScalarDB Analytics with PostgreSQL?

How does ScalarDB Analytics with PostgreSQL work?

Example of ScalarDB Analytics with PostgreSQL

Wrap-up

Written by Akihiro Okuno