Data Packaging: Part-3: Store the data in HDFS and MongoDB

3 min readMay 9, 2020

This is part-3 of Data Packaging series

Data Packaging series Contents

Introduction
Part-1: Downloading the data & Creating Volume
Part-2: Services on Spark Cluster
Part-3: Store the data in HDFS and MongoDB
Part-4: Deploy Jupyter Notebook in cluster (coming soon)

Store the data in HDFS and MongoDB

Store the data in HDFS as parquet

first read the unzipped data from the volume then push it as parquet into HDFS. to achieve that the following command to be run in spark shell

>>> acc_data = spark.read.csv()
>>> acc_data.write.parquet("hdfs://hadoop/acc_data_parquet" )

to check the data file on HDFS open http://localhost:50070 then navigate to “Utilities” in main bar, select “Browse the file system” then the below page will open.

Store the data in MongoDB

first in spark shell run the following command to be able pushing the data to MongoDB

spark = SparkSession \
        .builder \
        .appName("mongodb") \
        .master("spark://master:7077") \
        .config("spark.mongodb.input.uri", "mongodb://root:password@mongo/test.coll?authSource=admin") \
        .config("spark.mongodb.output.uri", "mongodb://root:password@mongo/test.coll?authSource=admin") \
        .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.4.0')\
        .getOrCreate()

Read the data from the volume and store it in MongoDB

>>>acc_mongo = spark.read.csv() >>>acc_mongo.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

Read the data from the HDFS and store it in MongoDB

acc_mongo = spark.read.csv("hdfs://hadoop/acc_data_parquet" ) acc_mongo.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

Open http://localhost:8181 for Mongo Express.

Then click on “test” for database

Automated Storage

Copy the scripts into the volume

In the volume a directory named “script” will be created and copied all the required scripts into that directory by executing the following command:

$ docker run --rm -v /scripts:/script \ -v project-scripts-volume:/volume busybox \ cp -r /script/ /volume

Checking the scripts in the volume:

$ docker run -it --rm -v project-scripts-volume:/volume busybox ls -l /volume/script

Store the data in HDFS as parque

By execute hdfs_store.py script as following:

docker run -t --rm \
  -v project-scripts-volume:/volume \
  --network=spark-network \
  mjhea0/spark:2.4.1 \
  bin/spark-submit \
    --master spark://master:7077 \
    --class endpoint \
    /volume/script/hdfs_store.py

Store the data in MongoDB

By execute mongodb_store.py script as following:

docker run -t --rm \
  -v project-scripts-volume:/volume \
  --network=spark-network \
  mjhea0/spark:2.4.1 \
  bin/spark-submit \
    --master spark://master:7077 \
    --class endpoint \
    --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 \
    /volume/script/mongodb_store.py

Data Packaging: Part-3: Store the data in HDFS and MongoDB

Store the data in HDFS and MongoDB

Automated Storage

Written by Khalil Hanna