How-to bring your Oracle Autonomous Database data into the Lakehouse.

Laurent Leturgez
8 min readSep 20, 2022

During my whole experience, I have worked a lot with Oracle Databases from developer to consultant. I have been twice certified master and part of the Oracle ACE program (as an alumnus now šŸ˜‰).

Today, and after 4 years as a Cloud and Data architect, Iā€™m still working with Data as a Solutions Architect for Databricks ā€¦ and guess what, I ā¤ļø that.

At Databricks, I learnt a lot about this data architecture which is the Lakehouse and how it can leverage business value from customerā€™s data. How ? Easy ... all enterprise data sit in one place, and these data can be used for all analytics workloads in the company. From Data Ingestion including streaming, to Data Engineering, Data Science and finally Data and Business Analytics. To do that, Databricks provides a single platform where all these people can collaborate with various languages (Python, SQL, Scala, R) based on a solid base of various popular Open Source projects (Apache Spark, Mlflow, Delta Lake etc.) and standards.

Ok, this is the Lakehouse. But beyond that, most of my previous customers were some Oracle fans and used it in many workloads.

Some of them have some OLTP workloads based on Oracle on premises or on the Oracle Autonomous Database and want to ingest and transform their data into the Lakehouse for Data Engineering, Data Science and Data analytics purposes.

Some others used it for analytics but decided to move to a modern data platform like Databricks. They wanted to continue using Oracle for their Data exposition layer but adopt Databricks for their ETL and Data Science workloads.

Because Databricks is built on standards, it can be fully integrated with Oracle in many ways. In this blog post, I propose to present how to use Databricks with Oracle Autonomous Database in Oracle Cloud Infrastructure.

Connect Databricks to Oracle Autonomous Database (ADB): Batch mode

In this part, I will explain how to connect Databricks to ADB and copy data from this ADB to a Delta Table in a batch mode.

To do that, I will use Databricks notebook feature to connect to ADB, read data from Oracle and write them into the bronze layer of our Lakehouse.

As mentioned in the picture, I will use JDBC to connect to ADB and transfer data to the Lakehouse. But as it can be very specific to connect to ADB, I will explain how to configure Databricks, the ADB and all you have to care about in order to get your data in the Lakehouse.

For this example, I used an Oracle Cloud Infrastructure (OCI) always Free tier that provides up to 2 Autonomous Databases, and especially an Autonomous Transaction Processing (ATP) flavor of Autonomous Database that is built for OLTP workloads.

Step #1, spin up a cluster on Databricks and configure it

To create a cluster, I used an ā€œAll purposeā€ cluster running a Databricks runtime (DBR) 10.4 LTS with 2 worker nodes (and a driver). But sizing will depend on your workload, data amount, Oracle partitioning etc.

Then, I installed a couple of libraries into the clusters.

  • The first bunch of jars is related to JDBC connectivity. For that purpose, I used the Zipped JDBC driver (ojdbc8.jar) and Companion Jars archive available at this URL: https://www.oracle.com/fr/database/technologies/appdev/jdbc-downloads.html.
    Download it, uncompress it and install all the libraries into the cluster. To do that, you can use the UI or the API, or the CLI.
    Please note that, when you upload jars to the clusters, the library will be copied into the DBFS (The Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage that provides an optimized FUSE (Filesystem in Userspace) interface that maps to native cloud storage API calls).
    This enables you to reuse those jars in another cluster or during a next cluster startup.
  • As the DBR is using a JDK8 version (https://docs.databricks.com/release-notes/runtime/10.4.html), Iā€™ve downloaded and installed the JCE extension files available at this URL: https://www.oracle.com/java/technologies/javase-jce8-downloads.html

Step #2, configure Oracle Autonomous Database

Once the autonomous database has been created on OCI, there are three steps to do:

  • Enable network ACLs and open the following CIDR: 0.0.0.0/0. (A smaller range is possible depending on your underlying network infrastructure and the network architecture between your Databricks configuration (AWS, Azure or GCP) and OCI.). To do that, you will have to access the main page of your ADB, and localize the Network part in the ā€œAutonomous Database informationā€ Tab, and click ā€œEditā€ just after ā€œAccess control listā€.
  • Because I will use user/password authentication, I will not use mTLS. To do that, on the ADB main page, in the Network section, just click on the ā€œEditā€ link related to Mutual TLS (mTLS) authentication and untick the related configuration:
  • Note the TLS connection string is in the ā€œDB Connectionā€ part. This will be used to configure the Spark Session.

Step #3, connect to database, query and put the data into a Delta Table

Once configuration is made on both sides: Databricks and OCI, you are able to connect your Databricks environment by using a notebook (or java/scala program) and initiate a JDBC connection to your Autonomous Database.

For that purpose, you will need to have:

  • a JDBC URL which targets the TLS endpoint of your choice
  • the Oracle JDBC driver class to use: oracle.jdbc.driver.OracleDriver
  • A username and a password. For the password, I recommend to store it into a secure secrets scope (on Databricks or into an Azure Keyvault):

See here to configure and use Databricks-backed secret scopes, or here if you decided to use an Azure Key-vault backed secret scope. For more information regarding the secrets management in Databricks, please read these docs:

The following code will create a JDBC connection to the Autonomous Database and create a spark Dataframe based on the targeted table (SH.CUSTOMERS).

NB: this is a really basic example and if, for example, you have to deal with a partitioned table on Oracle, you will need to take in account partitions options to align spark partitioning accordingly to the underlying partitioning strategy. See options: ā€œpartitionColumn, lowerBound, upperBound, numPartitions, pushDownAggregate, pushDownLimitā€ in the Spark documentation: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

NB: Especially for Oracle JDBC, itā€™s a best practice to set fetchsize because Oracle Driver has a low fetch size by default set to 10 rows.

Once the dataframe is created, you can now easily write its content to a Delta Table:

Itā€™s now easy to query the table and optionally, present the result directly on a plot. Or even better, run some analytical queries based on these data and other ones which come from other systems, or featurize these data and then train ML model on top of it ā€¦ this is the Lakehouse.

Connect Databricks to Oracle and import Data real time using CDC

A key point of the Lakehouse is the ability to process data real time. Data can come from various queueing messaging services (like Event Hub, Kafka, etc.) and RDBMS like Oracle as well. But to do that, solutions usually need to extract data from the transaction log files and translate them into DML statements.

Delta is a foundation of the Databricks platform and of the Lakehouse. As an ACID compliant solution, it will ease inserting data into Delta Tables.

Another foundation of Databricks is its capacity to integrate with many ISV partners and use their solution based on standards.

When talking about Change Data Capture solutions that are Oracle compatible and tightly integrated with Databricks, you can use either FiveTran, HVR, or Arcion platform.

Fivetran is a Databricks partner that is part of the Partner Connect program. They recently acquired HVR which provides a very interesting solution to extract CDC data from Oracle log files and insert them into target solutions. Databricks is part of these target solutions.

HVR is compatible with Autonomous Database.

If you are searching more information regarding HVR and especially on connecting HVR to Oracle, See the following URLs:

Arcion, like Fivetran, is part of the partner connect program, so the solution (specially Arcion Cloud) is tightly integrated with Databricks.

Arcion provides a solution for real-time enterprise database replication. It provides two types of their solution:

  • Arcion Cloud is a fully managed solution to replicate your data. Itā€™s the best fit for sources and targets that are available on the Cloud.
  • Arcion software is available under a self-hosted version that can run on a VM. This is more adapted for source databases which run on premises. But it runs as fast as Arcion Cloud version, you just have to manage the sources and targets declaration and a couple of extra parameters.

Configuring Arcion to replicate Oracle to Databricks real-time is an easy thing to do, especially with Arcion Cloud.

In the following video, youā€™ll be able to see how you can replicate Oracle to Databricks with Arcion Cloud and Arcion Self-hosted version.

Oracle to Databricks real time replication using Arcion Cloud and Partner Connect
Oracle to Databricks real time replication using Arcion Self hosted

Oracle has been used for years and is still used. With the Lakehouse, you will be able to pull data from many data-sources like Oracle databases regardless if they run on premises or on the cloud (even on Oracle Cloud Infrastructure). These data can be pulled real time from Oracle data sources and used at scale into the Lakehouse for various workloads (Data Analytics, Data Science and Machine Learning).

With all data in one place, whatever it is structured, semi-structured, unstructured, you will be able to deliver the highest value you were always looking for by implementing descriptive, predictive and prescriptive analytics in one place ā€¦ the Lakehouse.

--

--