Importing SQL Server data in BigQuery

A list of four approaches for a one-off data dump from a RDBMS like SQL Server to BigQuery, and an in-depth look at how to use Apache Sqoop and Google Dataproc to accomplish this.

Published in

datamindedbe

15 min readMay 15, 2020

I recently had to migrate around 90GB of data from a SQL server to BigQuery. That’s about 600 million records, spread out over 420 tables. Although this seems like a straight-forward task, it turns out that it is only possible to import SQL Server backups into another SQL Server. What to do?

By browsing the internet (one of my top skills ;-)) and asking around as to how one can migrate SQL Server data, I compiled list of four common approaches, ordered by decreasing number of times I saw them mentioned:

Export the data from SQL Server in CSV format. Upload the CSVs to Cloud Storage, and import using BigQuery’s import from CSV feature.
Use a managed, commercial ETL tool such as Stitch.
Use Python with Pandas to load data from SQL server (using pyodbc and SQL server’s ODBC connector) and save to BigQuery (using pandas_gbq or the official BigQuery client library). Approaches using other programming languages or libraries, such as Spark, also fall in this category.
Use Apache Sqoop to extract the data to Avro files on Cloud Storage, then import these files using BigQuerys import from Avro feature.

Each of these approaches can work, and might be the best solution depending on your specific circumstances. I briefly looked into each of these before settling on using Apache Sqoop. In the remainder of this story, I’ll look at all the approaches and explain why I didn’t pick them. I’ll then explain how to use Sqoop on GCP.

This story applies equally to other structured database sources such as an Oracle database or a MySQL database, and might even apply to you if your data’s destination is not BigQuery.

Four approaches

CSV file imports and Pandas

Yes, you read that correctly! (source, license).

I’m discussing these two approaches together. Migration using CSV files or Pandas seems easy, but requires more effort than it seems — either to export the CSV files, or in setting up a Python/Pandas environment and some custom glue-logic coding, or in dealing with issues relating to types. I did not end up using either mainly because it’s easy to run into issues with schema auto-detection and schema incompatibilities between SQL server and BigQuery.

Data types often cause problems when using methods that rely on schema auto-detection

From SQL server, the only straight-forward way to export to CSV was to export a single table as CSV at a time, using a graphical user interface. At 420 tables, that’s a lot of clicking. For Pandas, we can get the list of all tables in the source database using some basic scripting.
CSV files do not include type information. When importing large CSV files using schema auto-detection, BigQuery only looks only at a small subset of the data (100 (!) rows) to determine data types. I found that inferred data types are often wrong for properties that are often absent (NULL). Records that do not match the inferred schema do not get inserted.
Pandas infers the data type of nullable integers as floats, which is annoying and can cause rounding issues when working with very large integer values.
Although Pandas uses the SQL Server types and not type inference, I did experience undesired behavior with chunked reads when chunks contained a column consisting of only NULL values. Writing of such chunks caused the BigQuery table to be created with the wrong data type, preventing insertion of later chunks.¹ In my case, “NULL chunks” were fairly common, as columns that were added later would all be NULL for old records.
Issues related to types can be addressed by specifying data types manually (by specifying a data type to bq loador by using dtype and pd.Int64Dtype), but we did not want to do this for 420 tables. If you are dealing with a smaller number of tables, or if you are more patient than I am, then this might be ok for you.
You can run into problems when columns in SQL server have names that are illegal in BigQuery. You end up writing ad-hoc rules for rewriting column names. Not too bad, and necessary, but annoying.

Instead of using Pandas, you can use Spark, which also uses the schema from the source database to determine the types of the columns in the resulting data frame, and is better equipped to deal with the copying of very large tables that are hard to fit in memory. If you are familiar with Spark and have an environment set up, using Spark is a good option.

For both Pandas and Spark, instead of writing to BigQuery directly, consider dumping your database to Parquet or Avro on Cloud Storage, and then use bq load to load your data into BigQuery. I describe how to do this in more detail in the last part of this story. This is actually similar to how pandas_gbq and the BigQuery client library work under the hood,² but gives you some more control and allows loading of chunked read in a single load command.³

¹: Pandas uses SQLAlchemy for reading data, which should fetch data from SQL server using the correct types derived from the SQL server types. The issues that I encountered were likely not due to Pandas and pyodbc but due to pandas_gbq , although I didn’t investigate this further. Note that Google now supports loading and saving Pandas dataframes directly from the official BigQuery client library, which offers some advantages over pandas_gbq and is similarly easy to use; consider using the official Google library instead.
²: pandas_gbq uploads the data as CSV, the BigQuery client library uses Parquet
³: Use load wildcards to import multiple files with the same schema into a single data set using a single load command.

Managed solutions

As for managed solutions, I only looked into using Stitch, which I used before and like to work with. I opted not to use it because of speed and an overly clicky interface. This is mostly because Stitch is targeted at keeping a source (SQL server) and destination (BigQuery) synchronized, not one-off data transfers.

Stitch seems targeted at keeping a source and destination synchronized, not one-off data transfers

Data type conversion using Stitch is great. Unlike the CSV or Pandas option, Stitch does not infer the data types from the data, but looks at the table’s schema in SQL Server to determine the appropriate BigQuery data types.
Every table has to be added to the import configuration individually and manually. With every table I selected, I had to select how I wanted to keep source and destination in sync (I didn’t!). For me, this was the main reason not to use Stitch. It is likely possible to automate this using Stitch’s API.

Each table has to be added to the sync manually, which requires you to click a lot.

Stitch was slower than I expected it to be. This could be a concern if your database is really big. During my test, it took Stitch 80 minutes to extract 2.5GB of data (eight million rows). Extrapolation of the extraction time to my ~90GB of data yields an extraction time of two days. If your database has 1TB of data, that would be about three weeks. Because the import using Sqoop was much quicker, I know that my database is not the limiting factor here.
Cost could be a concern. Historical imports are free when you use an integration you haven’t used before, and Stitch’s trial offers unlimited data volume for 14 days. However, if you’re already using Stitch and have tried out the SQL server integration, your historical import could count towards your monthly quotum. If this is the case, try contacting Stitch (or create a new account, if this is allowed).

Apache Sqoop

Apache Sqoop a FOSS tool running on Hadoop specifically designed for transferring data from structured databases to Parquet or Avro files on HDFS (or into Apache Hive). Unlike CSV files, data in Avro files is typed, and like Stitch, Sqoop looks at the schema of the origin data to determine the correct data type to use in the Avro file. Once we have Avro files, it is easy to import these files into BigQuery using bq import (for free!).

The Sqoop project is not active, but its latest release is stable and mostly feature-complete

I didn’t opt to use Sqoop at first, because it seemed like the Sqoop project had been slowly dying for the last couple of years. The currently latest stable release, Sqoop 1.4.7, dates from the end of 2017. This remains true; while Sqoop 2 has been underway for a long while, it does not seem like a stable release is close (some providers have even deprecated support for Sqoop 2 in favor of Sqoop 1!). However, the Sqoop 1.4.7 release packs all the features I needed it to and is stable. Sqoop allowed me to quickly copy my database without any issues.

Although Sqoop supports incremental loads as well as full data dumps, I did not use or test its incremental loading functionality.

Using Sqoop to import data into BigQuery

Time to get started! In what follows I will assume that you are familiar with GCP and have set up the GCP SDK (gcloud) on your computer. I also assume Unix-style syntax and standard tooling, although it should be trivial to adapt commands to run on a Windows machine.

Sqoop does not come pre-installed on Dataproc, so running Sqoop on GCP requires you to provide the Sqoop JARs and dependencies to a Dataproc environment yourself. I based my approach on Daniel De Leo’s excellent story “Moving Data with Apache Sqoop in Google Cloud Dataproc”. Dataproc now runs on a newer version of Hadoop, comes with different pre-installed libraries and defaults that are incompatible with Sqoop, so I had to change some small parts. Using newer Dataproc environments (>1.3) probably also works, but I didn’t test this and just used the default one.

Dataproc does not come with Sqoop pre-installed

Note that Dataproc’s release notes of August 9 2019 mention that Sqoop 1.4.6 comes pre-installed with Dataproc. However, this does not seem to be the case for currently supported Dataproc releases. In fact, as far as I could see, even the 1.4.12-ubuntu18 -release mentioned in the release notes does not come with Sqoop pre-installed! I spinned up a cluster running this version of Dataproc, but did not find any trace of a Sqoop-library. There’s also no mention of Sqoop being removed in the release notes…

Preparation

Start by downloading and unpacking Sqoop 1.4.7 (made to run on Hadoop 2.6.0) and the required library to connect to SQL server:

wget https://apache.belnet.be/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
tar xvfz sqoop*
wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.2.1.jre8/mssql-jdbc-8.2.1.jre8.jar

In order to be able to store Sqoop’s output and host its dependencies, you will also need a storage bucket. I created a new one for this purpose:

BUCKET="gs://sqoop-bucket-1337"
gsutil mb gs://sqoop-bucket-1337

Once the bucket is created, copy in all the jars we downloaded (including Sqoop’s dependencies in lib):

gsutil cp sqoop-1.4.7.bin__hadoop-2.6.0/lib/* $BUCKET/jars
gsutil cp sqoop-1.4.7.bin__hadoop-2.6.0/sqoop-1.4.7.jar $BUCKET/jars
gsutil cp *.jar $BUCKET/jars

You are now ready to start a Dataprep cluster and submit a Sqoop job to it.

Running Sqoop

Sqoop runs on top of Hadoop. Fortunately, Google Cloud Platform makes spinning up a fully-featured Hadoop cluster through Google Dataproc as easy as a single call to gcloud dataproc clusters create . Spin up a “cluster” consisting of a single machine that acts as both the driver and only worker node as follows:

CLUSTERNAME="mycluster"
gcloud dataproc clusters create "$CLUSTERNAME" \
    --region europe-west1 \
    --zone europe-west1-d \
    --subnet default \
    --single-node \
    --master-machine-type n1-standard-2 \
    --master-boot-disk-size 200 \
    --image-version 1.3-deb9 \
    --project "$PROJECT" \
    --properties dataproc:dataproc.conscrypt.provider.enable=false

Wait until the cluster finished spinning up (< 1 minute). Then, submit a Sqoop job to your cluster as follows (instead of providing only the libraries that we need, here we just provide all of the libraries that come packaged as dependencies with the default Sqoop download — some of these could be removed):

libs=`gsutil ls $BUCKET/jars | paste -sd, --`
gcloud dataproc jobs submit hadoop \
    --region europe-west1 \
    --cluster="$CLUSTERNAME"\
    --jars=$libs \
    --class=org.apache.sqoop.Sqoop \
    -- \
    import-all-tables \
    -Dmapreduce.job.user.classpath.first=true \
 --connect "jdbc:sqlserver://$SQL_HOST;database=$SQL_DB" \
 --username "$SQL_USER" \
 --password "$SQL_PASS" \
 --num-mappers 1 \
 --as-avrodatafile \
 --warehouse-dir "$BUCKET/output"

Of course, change the parameters to Dataproc specifying the size of nodes, disk image used and so on as you please. Do not forget to define or replace the variables such as $SQL_HOST in this command by suitable values. If your database is not a SQL server database, change the connect string passed to --connect to the appropriate JDBC connect string. If you do not want to copy all tables from your database or want to do an incremental copy, don’t specify the import-all-tables argument and pass in appropriate Sqoop arguments instead.

You can now follow up on the job in the Console and cancel it if something goes wrong. Do not forget to shut down the Dataprep cluster once the job is done.

Note on security. Running the dataproc jobs submit command above will expose the password to your SQL server in plain text. It will be visible to anyone that has access to Dataproc logs. This is not generally recommended, but worked for me as I was copying the entire database to shut it down. If you want to keep your password more secure, you can use Sqoop’s --password-file option and point it to a file that contains your password. You can scp the password file to the VM running the cluster, or host this file in a bucket and set strict IAM policies on it. You can even use CMEK encryption if you feel the need!

Importing data into BigQuery

If all goes well, you will end up with a bunch of files inside of $BUCKET/output named $BUCKET/output/[table-name]/part-m-00000.avro . You can easily import these files into BigQuery using bq load, and this at no cost!

Start by creating a dataset to import your tables into using the Console or bq :

DATASET=ilovesqoop
bq mk $DATASET

You then want to run a bq load command for every table you import. As an example, to import the table exampletable you can execute:

bq --sync=false --project_id=$PROJECT \
    load \
    --source_format=AVRO \
    "${DATASET}.exampletable" \ 
    "gs://${BUCKET}/output/exampletable/part-m-00000.avro"

I created a small Python script import-bq.py that prints out the import commands required to import all the tables. (Did I mention I had to import 420 of them?)

Run the script as ./bq-import.py $PROJECT $BUCKET $DATASET ; you will see a single bq load command for every table to be imported.

You can of course adapt the script to your own needs. If you prefer, you can easily make the script perform the import directly by changing the print command on line 13 into subprocess.run .

Caveats and arcane options

You might also have noticed that there are two somewhat arcane options present in the commands above. For me, these were necessary to make Sqoop run properly on the Hadoop environment provided by Dataproc. Unfortunately, I waited a bit too long to write this story; I was so successful at migrating the data, that I now no longer have access a SQL server to connect to, and can no longer copy in the exact error messages that I got when submitting a job without using these settings. One of the options might not apply to you, for example because you are connecting to your server over an unsecured connection or because they are caused by the SQL server that I was connecting to being old.

--properties dataproc:dataproc.conscrypt.provider.enable=false . This disables the use of Conscrypt as Java’s security provider. Older versions of Dataproc did not use Conscrypt by default (source). With Conscrypt enabled, I believe I got an error similar to com.microsoft.sqlserver.jdbc.SQLServerException: The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption. You can find a description of this problem and the solution here.
-Dmapreduce.job.user.classpath.first=true. I added this option to get rid of an error related to Avro (org.apache.avro.reflect.ReflectData.addLogicalTypeConversion) as described here. This option instructs Hadoop to prefer the version of libraries that you added yourself to the classpath over the libraries that it’s pre-packaged with. Note that I have since learned that a better option to use might have been -Dmapreduce.job.classloader=true , which enables classloader isolation and instructs Hadoop to use a different classloader for Sqoop and it’s internal workings. Let me know in the comments if you tried using this instead of the classpath.first option and this works!

I also had to add the option -Djdk.tls.client.protocols=TLSv1 to my job submission command. I think the need for this option was specific to my SQL server, so I did not include it in the command above. If you run into TLS issues, try adding it. The default version of TLS used for HTTPS communciation in Java 8 is version 1.2; this option downgrades it to version 1.

I initially also wanted to export to Parquet instead of Avro. However, Sqoop 1.4.7 uses an old version the Kite library to accomplish this, which did not work properly on the new Hadoop environment provided by Dataproc. I got issues similar to the problem mentioned on Stackoverflow here. Instead of trying to use a newer version of Kite that does work, I opportunistically switched to the Avro format.

Speeding up Sqoop: using bigger clusters

You might have noticed that I used a Dataproc “cluster” consisting of only a single node and wondered why. Does increasing the number of nodes increase the speed of the import job? Of course it can … provided that you set it up correctly! The full answer to this question is related to the --num-mappers argument passed to the Sqoop job.

Running a parallel Sqoop import requires you to tell it how it can split up a table

To get a Sqoop job to run on multiple nodes you need to set --num-mappers to a value greater than one, which requires Sqoop to know how it can “split up” tables, so that it can work with multiple nodes on the same table at the same time. Sqoop can only figure out by itself how to split tables that have a single “primary index” column (no composite indexes), so you will probably not be able to simply run Sqoop using import-all-tables in parallel without a little bit of configuration. I didn’t bother and just used only a single node and a single mapper. A full import of my database (90GB, 600 million rows and 420 tables) took around 5 hours, and I would expect speed to increase a little when using a larger node.

Sqoop running on a single small-ish node imports around 20k records per second

As an alternative solution to increasing --num-mappers, you might look into submitting a separate import job for every table, which would also allow balancing the load over multiple nodes.

Note that if you do use multiple mappers, you will get more than one Avro file per table; this should not be a problem, but it’s something to keep in mind when importing the data in BigQuery (the script generating the import statements above should work fine).

Summary and concluding remarks

When migrating your SQL server data to BigQuery, you have many options. I discussed four of them: to export and import your data as CSV, to copy the data using Pandas, to use a managed service like Stitch, or to use Apache Sqoop. Each of these approaches certainly has its merits, but I found Sqoop to be an excellent option for performing a one-off data dump with minimal effort and explained why.

Despite Sqoop not having received a lot of love over the past years, it worked just fine. Sqoop managed to copy my database in about five hours using a cluster consisting of a single undersized node, where Stitch would have taken over two days. It’s free, cheap to run, and it converted the data types more or less the way I wanted to (datetimes were cast to epoch timestamps), without requiring any intervention on my part. Once you know how, setting up a copy operation using Sqoop requires only two commands, which in a way makes it easier to use than any paid-for solution I’ve come across.

Although Sqoop was specifically designed for the purpose of bridging operational, relational databases and big-data technology, it took a while before I even learned of its existence. Maybe this story can contribute to changing that.

Did this work for you or did you spot any errors? Let me know in the comments!

I work at Data Minded, an independent data engineering and data analytics consultancy based in Leuven, Belgium. Doing stuff like this is our bread and butter. We do it while walking in the park. Or something like that. Feel free to contact us!