Store Hadoop Tables as .csv into Edge Node Using Pig

Danny Wang
Aug 28, 2017 · 1 min read

If you have a table called mydb.mytable in Hadoop, and you want to save it into one .csv file in one machine, this article is for you. The solution here works for large database.

pig -useHCatalog -Dmapred.job.queue.name=my-queue

Replace my-queue with the queue name of your Hadoop cluster.

Now you shall be in pig’s command line interface. Run the following two lines one after another(They collect the database into multiple .csv files in your Hadoop system):

a = load ‘mydb.mytable’ USING org.apache.hive.hcatalog.pig.HCatLoader();STORE a INTO ‘/your_path/your_filename.csv’ USING org.apache.pig.piggybank.storage.CSVExcelStorage(‘,’, ‘NO_MULTILINE’, ‘NOCHANGE’, ‘WRITE_OUTPUT_HEADER’);

Once the above scripts finished successfully, exit pig and execute the following to :

hdfs dfs -getmerge /your_path/your_filename.csv /your_local_path/your_filename.csv

After that you should see your file into your your_local_path folder (in the edge node).

I learnt this method from my co-worker Hengliang. The good thing about it is that it won’t load everything into memory before writing into files, thus avoid memory overflow.

If you don’t have Pig, you can use Spark. If you are using spark2.0+, then simply:

df.write.option("header", "true").csv("/path/file.csv")

If you are using older spark, then you need to install/import a package called spark-csv.

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade