Database Migration : Migrating from DynamoDB to Google Cloud Spanner (Part 1)

Harsh Muniwala
Petabytz
Published in
9 min readJun 27, 2019

Amazon DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. DynamoDB lets you offload the administrative burdens of operating and scaling a distributed database, so that you don’t have to worry about hardware provisioning, setup and configuration, replication, software patching, or cluster scaling. Also, DynamoDB offers encryption at rest, which eliminates the operational burden and complexity involved in protecting sensitive data. For more information, see DynamoDB Encryption at Rest.

With DynamoDB, you can create database tables that can store and retrieve any amount of data, and serve any level of request traffic. You can scale up or scale down your tables’ throughput capacity without downtime or performance degradation, and use the AWS Management Console to monitor resource utilization and performance metrics.

Google Cloud Spanner

Google Cloud Spanner is a distributed relational database service that runs on Google Cloud. It is designed to support global online transaction processing deployments, SQL semantics, highly available horizontal scaling and transactional consistency.

Interest in Google Cloud Spanner centers on the cloud database’s ability to provide both availability and consistency. These traits are usually considered at odds with each other, with data designers typically making tradeoffs to emphasize either availability or consistency. The trade-off has been described most vividly in the CAP Theorem, which underpinned a general move to NoSQL databases for availability and scalability in web and cloud systems. In pursuing both system availability and data consistency, Google Cloud Spanner combines SQL and NoSQL traits.

How to Migrate from DynamoDB to Cloud Spanner

It is primarily intended for app owners who want to move from a NoSQL system to Cloud Spanner, a fully relational, fault-tolerant, highly scalable SQL database system that supports transactions. If you have consistent Amazon DynamoDB table usage, in terms of types and layout, mapping to Cloud Spanner is straightforward. If your Amazon DynamoDB tables contain arbitrary data types and values, it might be simpler to move to other NoSQL services, such as Cloud Datastore or Firebase.

Objectives

  • Migrate data from Amazon DynamoDB to Cloud Spanner.
  • Create a Cloud Spanner database and migration table.
  • Map a NoSQL schema to a relational schema.
  • Create and export a sample dataset that uses Amazon DynamoDB.
  • Transfer data between Amazon S3 and Cloud Storage.
  • Use Cloud Dataflow to load data into Cloud Spanner.

Costs

This tutorial uses the following billable components of Google Cloud Platform:

  • GKE
  • Cloud Pub/Sub
  • Cloud Storage
  • Cloud Dataflow

Cloud Spanner charges are based on the number of node-hours and the amount of data stored during the monthly billing cycle.

In addition to GCP resources, this tutorial uses the following Amazon Web Services (AWS) resources:

  • Amazon EMR
  • AWS Lambda
  • Amazon S3
  • Amazon DynamoDB

These services are only needed during the migration process.

Preparing environment

In this tutorial, you run commands in Cloud Shell. Cloud Shell gives you access to the command line in GCP, and includes the Cloud SDK and other tools that you need for GCP development. Cloud Shell can take several minutes to initialize.

  1. Activate Cloud Shell.
  2. Set the default Compute Engine zone. For example, us-central1-b.
    (gcloud config set compute/zone us-central1-b)
  3. Clone the GitHub repository containing the sample code. (git clone https://github.com/GoogleCloudPlatform/dynamodb-spanner-migration.git)
  4. Go to the cloned directory. (cd dynamodb-spanner-migration)
  5. Create a Python virtual environment. (virtualenv — python python2 env)
  6. Activate the virtual environment. (source env/bin/activate)
  7. Install the required Python modules. (pip install -r requirements.txt)

Configuring AWS access

In this tutorial, you create and delete Amazon DynamoDB tables, Amazon S3 buckets, and other resources. To access these resources, you first need to create the required AWS Identity and Access Management (IAM) permissions. You can use a test or sandbox AWS account to avoid affecting production resources in the same account.

Create an AWS IAM role for AWS Lambda

In this section, you create an AWS IAM role that AWS Lambda uses at a later step in the tutorial.

  1. In the AWS console, go to the IAM section, click Roles, and then select Create role.
  2. Under Choose the service that will use this role, click Lambda, and then select Next:Permissions.
  3. In the Policy Type box, enter AWSLambdaDynamoDBExecutionRole.
  4. Select the AWSLambdaDynamoDBExecutionRole checkbox, and then click Next:Review.
  5. In the Role name box, enter dynamodb-spanner-lambda-role, and then click Create role.

Create an AWS IAM user

Follow these steps to create an AWS IAM user with programmatic access to AWS resources, which are used throughout the tutorial.

  1. While you are still in the IAM section of the AWS console, click Users, and then select Add User.
  2. In the User name box, enter dynamodb-spanner-migration.
  3. Under Access type, click Programmatic access.
  4. Click Next: Permissions.
  5. Click Attach existing policies directly and select the following two policies: ( AmazonDynamoDBFullAccesswithDataPipeline, AmazonS3FullAccess )
  6. Click Next: Review, and then click Create user.
  7. Click Show to view the credentials. The access key ID and secret access key are displayed for the newly created user. Leave this window open for now because the credentials are needed in the following section. Safely store these credentials because with them, you can make changes to your account and affect your environment. At the end of this tutorial, you can delete the IAM user.

Configure AWS command-line interface

  1. In Cloud Shell, configure the AWS Command Line Interface (CLI). (aws configure)
  2. The following output appears:

$ aws configure
AWS Access Key ID [None]: PASTE_YOUR_ACCESS_KEY_ID
AWS Secret Access Key [None]: PASTE_YOUR_SECRET_ACCESS_KEY
Default region name [None]: us-west-2
Default output format [None]:
user@project:~/dynamodb-spanner$

  • Enter the ACCESS KEY ID and SECRET ACCESS KEY from the AWS IAM account that you created.
  • In the Default region name field, enter us-west-2. Leave other fields at their default values.

3. Close the AWS IAM console window.

Preparing the Amazon DynamoDB table

In the following section, you create an Amazon DynamoDB source table and populate it with data.

In Cloud Shell, create an Amazon DynamoDB table that uses the sample table attributes.

aws dynamodb create-table — table-name Migration \ — attribute-definitions AttributeName=Username,AttributeType=S \ — key-schema AttributeName=Username,KeyType=HASH \ — provisioned-throughput ReadCapacityUnits=75,WriteCapacityUnits=75

Verify that the table status is ACTIVE.

aws dynamodb describe-table — table-name Migration \ — query ‘Table.TableStatus’

Populate the table with sample data.

python make-fake-data.py — table Migration — items 25000

Creating a Cloud Spanner database

You create a single-node instance, which is appropriate for testing and the scope of this tutorial. For a production deployment, refer to the documentation for Cloud Spanner instances to determine the appropriate node count to meet your database performance requirements.

In this example, you create a table schema at the same time as the database. It is also possible, and common, to carry out schema updates after you create the database.

Create a Cloud Spanner instance in the same region where you set the default Compute Engine zone. For example, us-central1.

gcloud spanner instances create spanner-migration \ — config=regional-us-central1 — nodes=1 \ — description=”Migration Demo”

Create a database in the Cloud Spanner instance along with the sample table.

gcloud spanner databases create migrationdb \ — instance=spanner-migration \ — ddl “CREATE TABLE Migration ( \ Username STRING(1024) NOT NULL, \ PointsEarned INT64, \ ReminderDate DATE, \ Subscribed BOOL, \ Zipcode INT64, \ ) PRIMARY KEY (Username)”

Pausing the database

The next sections show you how to export the Amazon DynamoDB source table and set Cloud Pub/Sub replication to capture any changes to the database that occur while you export it.

Stream changes to Cloud Pub/Sub

You use an AWS Lambda function to stream database changes to Cloud Pub/Sub.

  1. In Cloud Shell, enable Amazon DynamoDB streams on your source table. (aws dynamodb update-table — table-name Migration \ — stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES)
  2. Set up a Cloud Pub/Sub topic to receive the changes.

gcloud pubsub topics create spanner-migration.

The following output appears:

3. Create a Cloud IAM service account to push table updates to the Cloud Pub/Sub topic.

gcloud iam service-accounts create spanner-migration \
— display-name=”Spanner Migration”

The following output appears:

4. Create a Cloud IAM policy binding so that the service account has permission to publish to Cloud Pub/Sub. ReplaceGOOGLE_CLOUD_PROJECT with the name of your GCP project.

gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
— role roles/pubsub.publisher \
— member serviceAccount:spanner-migration@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com

The following output appears:

5. Create credentials for the service account.

gcloud iam service-accounts keys create credentials.json \
— iam-account spanner-migration@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com

The following output appears:

6. Prepare and package the AWS Lambda function to push Amazon DynamoDB table changes to the Cloud Pub/Sub topic.

pip install — ignore-installed — target=lambda-deps google-cloud-pubsub==0.35
cd lambda-deps; zip -r9 ../pubsub-lambda.zip *; cd -
zip -g pubsub-lambda.zip ddbpubsub.py

7. Create a variable to capture the Amazon Resource Name (ARN) of the Lambda execution role that you created earlier.

LAMBDA_ROLE=$(aws iam list-roles \ — query ‘Roles[?RoleName==`dynamodb-spanner-lambda-role`].[Arn]’ \ — output text)

8. Use the pubsub-lambda.zip package to create the AWS Lambda function.

aws lambda create-function — function-name dynamodb-spanner-lambda \ — runtime python2.7 — role $LAMBDA_ROLE \ — handler ddbpubsub.lambda_handler — zip fileb://pubsub-lambda.zip \ — environment Variables=”{SVCACCT=$(base64 -w 0 credentials.json),PROJECT=$GOOGLE_CLOUD_PROJECT,TOPIC=spanner-migration}”

The following output appears:

9. Create a variable to capture the ARN of the Amazon DynamoDB stream for your table.

STREAMARN=$(aws dynamodb describe-table \ — table-name Migration \ — query “Table.LatestStreamArn” \ — output text)

10. Attach the Lambda function to the Amazon DynamoDB table.

aws lambda create-event-source-mapping — event-source $STREAMARN \ — function-name dynamodb-spanner-lambda — enabled \ — starting-position TRIM_HORIZON

11. To optimize responsiveness during testing, add --batch-size 1 to the end of the previous command, which triggers the function every time you create, update, or delete an item.

The following output appears:

Export the Amazon DynamoDB table to Amazon S3

  1. In Cloud Shell, create a variable for a bucket name that you use in several of the following sections. (BUCKET=$DEVSHELL_PROJECT_ID-dynamodb-spanner-export)
  2. Create an Amazon S3 bucket to receive the DynamoDB export. (aws s3 mb s3://$BUCKET)
  3. In the AWS Management Console, click Data Pipeline.
  4. Click Create new pipeline to define the export job.
  5. In the Name field, enter Export to Amazon S3.
  6. For the Source, select the following: Build using a template -> Export DynamoDB table to Amazon S3
  7. In the Parameters section, define the following:
  8. In the Source DynamoDB table name field, enter Migration.
  9. In the Output S3 folder field, click the Folder icon and select the [Your-Project-ID]-dynamodb-spanner-export Amazon S3 bucket that you just created where [YOUR-PROJECT-ID] represents your GCP project ID.
  10. To consume all available read-capacity during the export, in the DynamoDB read throughput ratio field, enter 1. In a production environment, you adjust this value so that it doesn't hinder live operations.
  11. In the Region of your DynamoDB table field, enter the name of the region, for example, us-west-2.
  12. To start the backup jobs immediately, in the Schedule section for Run, click On pipeline activation.
  13. Under Pipeline Configuration, in the Logging field, enter Disabled. If you are following this guide to migrate a production table, leave this option enabled and pointed at a separate Amazon S3 bucket for logs to help you troubleshoot errors. Leave other default parameters.
  14. To begin the backup process, click Activate.
  15. If you are prompted to address validation warnings, click Activate. In a production situation, you set a maximum duration for the job and enable logging.
  16. Click Refresh to update the status of the backup process. The job takes several minutes to create the resources and finish exporting. In a production environment, you can speed this process up by modifying the Data Pipelinejobs to use more EMR resources.
  17. When the process finishes, look at the output bucket.
  • aws s3 ls — recursive s3://$BUCKET

The export job is done when there is a file named _SUCCESS.

--

--