Store Hadoop Tables as .csv into Edge Node Using Pig
If you have a table called mydb.mytable in Hadoop, and you want to save it into one .csv file in one machine, this article is for you. The solution here works for large database.
pig -useHCatalog -Dmapred.job.queue.name=my-queueReplace my-queue with the queue name of your Hadoop cluster.
Now you shall be in pig’s command line interface. Run the following two lines one after another(They collect the database into multiple .csv files in your Hadoop system):
a = load ‘mydb.mytable’ USING org.apache.hive.hcatalog.pig.HCatLoader();STORE a INTO ‘/your_path/your_filename.csv’ USING org.apache.pig.piggybank.storage.CSVExcelStorage(‘,’, ‘NO_MULTILINE’, ‘NOCHANGE’, ‘WRITE_OUTPUT_HEADER’);
Once the above scripts finished successfully, exit pig and execute the following to :
hdfs dfs -getmerge /your_path/your_filename.csv /your_local_path/your_filename.csvAfter that you should see your file into your your_local_path folder (in the edge node).
I learnt this method from my co-worker Hengliang. The good thing about it is that it won’t load everything into memory before writing into files, thus avoid memory overflow.
If you don’t have Pig, you can use Spark. If you are using spark2.0+, then simply:
df.write.option("header", "true").csv("/path/file.csv")If you are using older spark, then you need to install/import a package called spark-csv.