Data Packaging: Part-3: Store the data in HDFS and MongoDB
This is part-3 of Data Packaging series
Data Packaging series Contents
- Introduction
- Part-1: Downloading the data & Creating Volume
- Part-2: Services on Spark Cluster
- Part-3: Store the data in HDFS and MongoDB
- Part-4: Deploy Jupyter Notebook in cluster (coming soon)
Store the data in HDFS and MongoDB
- Store the data in HDFS as parquet
first read the unzipped data from the volume then push it as parquet into HDFS. to achieve that the following command to be run in spark shell
>>> acc_data = spark.read.csv()
>>> acc_data.write.parquet("hdfs://hadoop/acc_data_parquet" )
to check the data file on HDFS open http://localhost:50070 then navigate to “Utilities” in main bar, select “Browse the file system” then the below page will open.
- Store the data in MongoDB
first in spark shell run the following command to be able pushing the data to MongoDB
spark = SparkSession \
.builder \
.appName("mongodb") \
.master("spark://master:7077") \
.config("spark.mongodb.input.uri", "mongodb://root:password@mongo/test.coll?authSource=admin") \
.config("spark.mongodb.output.uri", "mongodb://root:password@mongo/test.coll?authSource=admin") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.4.0')\
.getOrCreate()
- Read the data from the volume and store it in MongoDB
>>>acc_mongo = spark.read.csv() >>>acc_mongo.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()
- Read the data from the HDFS and store it in MongoDB
acc_mongo = spark.read.csv("hdfs://hadoop/acc_data_parquet" ) acc_mongo.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()
Open http://localhost:8181 for Mongo Express.
Then click on “test” for database
Automated Storage
- Copy the scripts into the volume
In the volume a directory named “script” will be created and copied all the required scripts into that directory by executing the following command:
$ docker run --rm -v /scripts:/script \ -v project-scripts-volume:/volume busybox \ cp -r /script/ /volume
- Checking the scripts in the volume:
$ docker run -it --rm -v project-scripts-volume:/volume busybox ls -l /volume/script
- Store the data in HDFS as parque
By execute hdfs_store.py script as following:
docker run -t --rm \
-v project-scripts-volume:/volume \
--network=spark-network \
mjhea0/spark:2.4.1 \
bin/spark-submit \
--master spark://master:7077 \
--class endpoint \
/volume/script/hdfs_store.py
- Store the data in MongoDB
By execute mongodb_store.py script as following:
docker run -t --rm \
-v project-scripts-volume:/volume \
--network=spark-network \
mjhea0/spark:2.4.1 \
bin/spark-submit \
--master spark://master:7077 \
--class endpoint \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 \
/volume/script/mongodb_store.py