Copy Hadoop Data — Hive to S3 Bucket

John Thuma
DataSeries
Published in
2 min readNov 23, 2018

WHAT IS S3: S3 stands for “Simple Storage Service” and is offered by Amazon Web Services. It provides a simple to use file object storage via a web service. AWS provides a web based UI to S3 as well as the AWS CLI (command line interface).

WHY USE S3: Many organizations are moving data to the cloud because it is a more affordable option than storing it locally. Some organizations are leveraging S3 from Amazon Web Services (AWS) so that they can use data easily via other compute environments such as Hadoop, RDBMS, or take your pick of an EC2 services to crunch data.

HOW TO MOVE DATA TO S3 FROM HDFS: Follow the process documented below:

If you have ever wanted to move data from a Hadoop environment into an S3 bucket there is a very simple way to do it. It requires two steps:

STEP1: Create an S3 Bucket

STEP2: Use distcp utility to copy data from your hadoop platform to the S3 bucket created in STEP1.

Below are the details for each STEP!

STEP 1: Create an S3 Bucket

  1. Sign in to the preview version of the AWS Management Console.

2. Under Storage & Content Delivery, choose S3 to open the Amazon S3 console.

3. From the Amazon S3 console dashboard, choose Create Bucket.

4. In Create a Bucket, type a bucket name in Bucket Name.

5. Select the region you want to use

6. Click create

STEP 2: Move your data from Hadoop to the new S3 Bucket.

  1. Open up a terminal session of the source hadoop system:
  2. Use distcp to move data from Hadoop HDFS to the new S3 bucket

It will look something like this:

hadoop distcp -Dfs.s3a.access.key=AKIAHIDEHIDEHIDEHIDE -Dfs.s3a.secret.key=RealLYHidE+ReallYHide+ReallyHide hdfs://{yoursystemname}:{port}/user/hive/warehouse/databaseDirectory/datadirectory/ s3a://{yourbucket}/{somedirectoryStructure}/

Let’s dissect the statement: (It has 3 parts)

  1. hadoop distcp -Dfs.s3a.access.key=AKIAHIDEHIDEHIDEHIDE -Dfs.s3a.secret.key=RealLYHidE+ReallYHide+ReallyHide

NOTE: This is the hadoop distro copy command. It allows you to copy data in and out of a Hadoop system. The access.key and secret key are found on your IAM settings within AWS. This is a security to safeguard your data in the bucket.

2. hdfs://{yoursystemname}:{port}/user/hive/warehouse/databaseDirectory/datadirectory/

NOTE: This is the HDFS location of your data or SOURCE to be copied to S3. Note that if the data is partitioned there will be many subdirectories under the datadirectory directory. Your path might be different than mine.

3. s3a://{yourbucket}/{somedirectoryStructure}/

NOTE: This is the S3 bucket TARGET where your data will be copied from the SOURCE in 2.

Once this is complete go into the AWS console and take a look in your bucket and see if the data is there.

--

--

John Thuma
DataSeries

Experienced Data and Analytics guru. 30 years of hands-on keyboard experience. Love hiking, writing, reading, and constant learning. All content is my opinion.