Moving On-Prem Oracle Databases to Snowflake

A Simple Way to Export from Oracle and Import to Snowflake

--

by Yu Fai Wong

Snowflake is a cloud-based data warehouse that delivers an outstanding performance to price ratio, however, in order to fully utilize it you have to move data into it, either from your on-premise sources or cloud-based sources. Something that our firm, Hashmap, gets asked to do a lot is just this — setup the process and data pipeline to move source relational databases into Snowflake, mostly from Oracle and SQL Server RDBMS instances.

So, how do you export your data into a format that Snowflake can ingest, and how can you import this data into Snowflake as an initial load? At first glance, it may seem like this is a simple task; if done correctly it is, and I’ll try to help you avoid pitfalls along the way by showing you a simple way to export from Oracle RDBMS and load the data into Snowflake.

As a note, the focus of this post is an initial, one-time load. I won’t be discussing requirements for change data capture (CDC) or incremental loading.

An Example Use Case for an Initial One-Time Load

Use cases that come to mind for an initial load only are Auditing and Reporting. For example, you might want to audit all the transactions for a bank, but only from a specific year. This is a case where CDC wouldn’t be used, but you also don’t want to run unnecessary queries against a source system.

So let’s get started.

Step 1: Export from Oracle

While there are many ways to export data from an Oracle database, ideally we want to export it into a format that won’t require any transformations for when we import into Snowflake. CSV, XML, JSON formats are possible options but CSV is the simplest. There are ways to export CSV using just select statements, however, the process is too error-prone and doesn’t handle edge cases. Instead, we’ll use a free tool provided by Oracle called SQL Developer Command Line (SQLcl).

A note that my testing was done with Oracle 12.2 0.3 aka 19c, and although I haven’t tested other Oracle versions, all current Oracle premier supported or Oracle extended supported versions should work similarly (using the standard JDBC connection) including 11.2, 12.1, 12.2, 12.2. 0.2 aka Oracle 18c.

SQLcl is packaged with SQL developer, which is provided with every Oracle version.

Download SQLcl

SQLcl is essentially SQL*Plus with some additional features and capabilities that will make the exporting of data easier. To get SQLcl, it’s a simple download from oracle.com.

Connect to Your Database

Once you’ve unzipped the program, you can connect to your target database in the same way as SQL*Plus.

.\sql.exe username@database

Outputting the CSV File

First, since we plan to use the SPOOL functionality to write out files, we want to turn off FEEDBACK since that variable controls the part that outputs the row numbers at the end of a select.

Second, in order to use SQLcl’s features, you need to set the SQLFORMAT to CSV. There are other format options (html, xml, json), you can look them up with HELP SET SQL FORMAT.

Third, depending on what country and what language settings you have on your computer, Oracle may default to certain types of formatting. In order for the export to be a valid CSV, and for the date and time formats to be more easily imported, adjusting the NLS settings is important.

NLS_NUMERIC_CHARACTERS determines the decimal and thousands separator. The thousands separator is not normally shown, but the decimal one is.

Now, we are ready to output. The only thing we have to do next is set the file you want to SPOOL and to run the select statement. The files can be split by filtering from the select statement. The comment /*+ PARALLEL */ is a special operator for SQLcl telling it to execute the query in parallel. With more complicated conditions, parallel can speed up your query significantly.

You should now have a CSV file with a column header.

Step 2: Import to Snowflake

For this example, we’ll be using SnowSQL to stage our files into Snowflake, and then to copy into a table. SnowSQL is Snowflake’s easy-to-use, command-line client for connecting to Snowflake to execute SQL queries and perform all DDL and DML operations, including loading data into and unloading data out of database tables.

Connecting to Snowflake

In order to connect to a Snowflake instance, you need to provide the connection info. The easiest way to do this is to fill out the config file stored in your home directory ie. ~/.snowsql/config . In Windows, this will be in your user directory.

Your configuration will depend on the way you log in, but it will look similar to the below example if you’re using single sign-on.

Once the connection is configured, simply run SnowSQL.

Using the PUT Command

The command to import a file into Snowflake is called PUT. There are many options for the PUT command, including things such as where to stage files. For this example, we’ll stage directly in the Snowflake internal tables staging area. This assumes you’ve already created the table in Snowflake. As a note, PUT does not support uploading files to external (Amazon S3, Google Cloud Storage, or Microsoft Azure) stages.

The reference documentation for the PUT command is below:

Checking Out Staged Files with the LIST Command

To see the files currently staged, you can run a command called LIST.

list @HASHMAP_TRAINING_DB.IMPORT_TEST.%EMPLOYEES;

Coping the CSV Into Snowflake

We’re now ready to copy directly into the table in Snowflake with the following command:

A Quick Review of the Options

From: Shows which table to select from, and (not shown) can also contain an inner select statement to reorder columns, transform, or default values if necessary. Refer to the copy into command reference for examples.

File_format: Specifies the format of the file to import. It contains other sub-options that must be filled.

Type: CSV is what we’re using, but XML and JSON are also possible.

Field_optionally_enclosed_by: Specifies the character that encloses a single CSV field.

Skip_header: The number of rows to skip as part of the header. Since we output with column names, we need to skip one row.

Timestamp_format: This string tells Snowflake how to parse DateTime formats and timezones.

Pattern: A regex that indicates which files to copy from.

Validation_Mode: Not shown, but a way to validate the import file without committing into the table itself. Possible options include RETURN_<N>_ROWS where <N> is a number of rows to attempt, RETURN_ERRORS and RETURN_ALL_ERRORS which will show what the errors in the import are.

You Are Done!

Once it’s complete, you can select from Snowflake and see the data imported:

A Quick Q&A On the Approach and Options

Are there any source Oracle database size limitations or restrictions?

Theoretically, there shouldn’t be, but the reality can be quite different. The main size limitations and restrictions are actually due to the output file, because it’s uncompressed text, and may take up quite a bit of file disk space. If the amount of data is very large, you may want to consider a solution using ETL software to more robustly handle the creation of staging files and any errors that might occur.

If the query is going to produce a large number of rows, you may have to appropriately window your output. Whether it’s through a function or a key, keeping the file size smaller will make the importing easier. Snowflake recommends about 100mb files at a time.

What about other Oracle database versions besides 19c?

As mentioned above, my testing was done with Oracle 12.2 0.3 aka 19c, and although I haven’t tested other Oracle versions, all current Oracle premier supported or Oracle extended supported versions should work similarly (using the standard JDBC connection) including 11.2, 12.1, 12.2, 12.2. 0.2 aka Oracle 18c.

If you have one of those versions, you can test this out by connecting with SQLcl and using the SQL format command followed by a Select.

What is the latency associated with an on-prem Oracle to Snowflake one-time load?

The two biggest sources of latency will likely be your internet uplink speed, or the query running time. Because Snowflake is a cloud data warehouse (it does not operate on-premise), transfer speeds of the file will usually be much slower than the latency from exporting CSV or copying from cloud staging into tables themselves. Fortunately, SnowSQL transfers the CSV files in compressed form, but other cloud solutions may not be as optimized.

Since this method uses SPOOL, it requires a query to be run to output to a file. The more complex the query, the longer it may run. For the more complex queries, consider using a prebuilt view to improve performance.

Does it matter if my Oracle database is on-premise or in the cloud?

The was really aimed at on-premise Oracle databases. If the Oracle database was in a cloud instance, run the exports from that instance if possible, to avoid transferring the CSV data unnecessarily from a cloud -> local -> back to cloud path.

What about a doing a one-time load from an Oracle RAC environment?

RAC was not tested as part of this effort, but I anticipate that an Oracle RAC instance will work with this method, although some modifications will be needed to the way you generate output if the query is large, since you will want to create a number of smaller files versus one large file.

Does this approach allow for any type of CDC?

This approach could be adapted to a CDC model if you have an appropriate way to window your output (like id’s, or timestamps), but you may want to consider a more robust CDC solution such as Attunity, HVR, or other options, especially if you have a lot of RDBMS sources that will be incrementally loaded to Snowflake. This blog post was not meant to test CDC requirements, it just addresses a single load type of solution.

Where could I use this approach where CDC is not required?

It was mentioned above, but auditing and reporting are use cases that come to mind. For example, you might want to audit all the transactions for a bank, but only from a specific year. This is a case where CDC won’t be used, but you also may not want to run unnecessary queries against a live source database.

Does this approach work only with Snowflake or could it be used for other cloud data warehouses?

Since the output is CSV, you can import it into anything you want. You would just need to follow the cloud data warehouse specific methods of import.

Final Thoughts

The devil is always in the details. Small errors in the formatting of timestamps, numbers, and columns can all contribute to failing loads.

For instance, if I had known that regional formatting was going to be an issue, I would have set my region to English (US) from the start. That is something to consider if it’s being run from a VM. Oracle’s SQLcl has its own ideas of what the regional formatting should be, irrespective of your operating systems settings.

Hopefully, what I’ve shared through my experience helps you out and adds another low-cost method to your Snowflake toolbag for quickly getting initial one-time Oracle database loads completed (and avoiding pitfalls) so that you can continue to get the most value out of Snowflake!

Need Snowflake Cloud Data Warehousing and Migration Assistance?

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.

How does Snowflake compare to other data warehouses? Our technical experts have implemented over 250 cloud/data projects in the last 3 years and conducted unbiased, detailed analyses across 34 business and technical dimensions, ranking each cloud data warehouse.

--

--