Data Packaging: Part-3: Store the data in HDFS and MongoDB

Khalil Hanna
3 min readMay 9, 2020

--

This is part-3 of Data Packaging series

Data Packaging series Contents

Store the data in HDFS and MongoDB

  • Store the data in HDFS as parquet

first read the unzipped data from the volume then push it as parquet into HDFS. to achieve that the following command to be run in spark shell

>>> acc_data = spark.read.csv()
>>> acc_data.write.parquet("hdfs://hadoop/acc_data_parquet" )

to check the data file on HDFS open http://localhost:50070 then navigate to “Utilities” in main bar, select “Browse the file system” then the below page will open.

  • Store the data in MongoDB

first in spark shell run the following command to be able pushing the data to MongoDB

spark = SparkSession \
.builder \
.appName("mongodb") \
.master("spark://master:7077") \
.config("spark.mongodb.input.uri", "mongodb://root:password@mongo/test.coll?authSource=admin") \
.config("spark.mongodb.output.uri", "mongodb://root:password@mongo/test.coll?authSource=admin") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.4.0')\
.getOrCreate()
  • Read the data from the volume and store it in MongoDB
>>>acc_mongo = spark.read.csv() >>>acc_mongo.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()
  • Read the data from the HDFS and store it in MongoDB
acc_mongo = spark.read.csv("hdfs://hadoop/acc_data_parquet" ) acc_mongo.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

Open http://localhost:8181 for Mongo Express.

Then click on “test” for database

Automated Storage

  • Copy the scripts into the volume

In the volume a directory named “script” will be created and copied all the required scripts into that directory by executing the following command:

$ docker run --rm -v /scripts:/script \ -v project-scripts-volume:/volume busybox \ cp -r /script/ /volume
  • Checking the scripts in the volume:
$ docker run -it --rm -v project-scripts-volume:/volume busybox ls -l /volume/script
  • Store the data in HDFS as parque

By execute hdfs_store.py script as following:

docker run -t --rm \
-v project-scripts-volume:/volume \
--network=spark-network \
mjhea0/spark:2.4.1 \
bin/spark-submit \
--master spark://master:7077 \
--class endpoint \
/volume/script/hdfs_store.py
  • Store the data in MongoDB

By execute mongodb_store.py script as following:

docker run -t --rm \
-v project-scripts-volume:/volume \
--network=spark-network \
mjhea0/spark:2.4.1 \
bin/spark-submit \
--master spark://master:7077 \
--class endpoint \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 \
/volume/script/mongodb_store.py

--

--