Mongo DB + PySpark + Tableau 2021

Xavier Ivan Aguas
Allient
Published in
7 min readApr 22, 2021

COMPLETE GUIDE

On this occasion, I am going to present you with a very interesting article where three technologies are combined for data analysis and visualization. To carry out this small project, Mongo DB is used for databases, PySpark was used to create ETLs (Extract, Transform, Loading), and Tableau for data visualization.

In this study, the behavior of natural disasters is analyzed, in order to obtain a spatial understanding of the areas with the greatest presence of events.

The data for the analysis is a compendium of the telluric phenomena produced from 1965 to 2016 and that have been documented through different sensors installed to observe the behavior and its impact.

The data were obtained from the repository called Kaggle. This is the dataset for the analysis: earthquakes dataset.

The data set contains 23,412 records that have been taken from the National Earthquake Information Center (NEIC), which is one of the centers in charge of the study and understanding of seismic phenomena and their effects.

This is the diagram of the proposal where the three tools mentioned above are combined.

Image 1. Software Architecture

Before starting, we are going to install Mongo DB on a MAC through the following steps:

Prerequisites

Install Xcode Command-Line Tools

xcode-select --install

Install Homebrew

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Installing MongoDB 4.4 Community Edition

brew tap mongodb/brew
brew install mongodb-community@4.4

Alternatively, you can specify a previous version of MongoDB if desired. You can also maintain multiple versions of MongoDB side by side in this manner.

Run MongoDB Community Edition

To verify that the installation was successful we verify through running the database.

brew services start mongodb-community@4.4

and finally

mongo
Image 2. Mongo DB Working.

If you installed Mongo DB in Catalina you probably have problems. The problem was solved through the following article: Solution

To better observe the databases created in Mongo DB we use the following program NoSQLBooster for Mongo DB, which allows to manage and make requests more easily.

Image 3. NoSQLBooster For MongoDB Dashboard.

Installing pySpark

PySpark is an open source framework for parallel computing using clusters. It is used especially to speed up the iterative computation of large amounts of data or very complex models.

Image 4. PySpark helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language.

Now we are going to install PySpark on Mac, through the following steps:

Install Java 8

Spark requires Java8:

brew install openjdk@8

and

sudo ln -sfn /usr/local/opt/openjdk@8/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-8.jdk

Install Scala

Apache-Spark is written in Scala, which is a requirement to run it.

brew install scala

Install Spark

brew install apache-spark

After installing the packages, it is good to check your system with

brew doctor

Install pySpark

If not installed already Python 3, you can do with homebrew:

brew install python3

Finally, you need to install the python Spark API pySpark:

pip3 install pyspark
pip3 install findspark

Create ETLS with pySpark

Now we are going to configure our file to create ETLs through pySpark.

Review dataset:

Image 5. Review column names.

Then, we remove the columns that do not matter in this study:

Image 6. Columns for the analysis.

Create a new column with the year of the natural disaster and create a new dataframe:

Image 7. Dataframe with year column.
Image 8. New Dataframe with number of disaster per year.

Now, it is important to change the data from string to numerical:

Image 9. Dataframe with transformed variables (string to number).

We calculate the maximum and average magnitude in each year:

Image 10. Second dataframe ready !

Finally, we send the ETLs created to Mongo DB through the following code

Then, we open the NoSQLBooster for MongoDB program and check the ETL’s have been added.

Image 11. ETL’s stored in MongoDB.

Tableau

Tableau is a visual analytics platform that transforms the way we use data to solve problems. Plus, it enables individuals and organizations to get the most out of data.

Through simple functions such as drag and drop, anyone can easily access and analyze data, and even create reports and share this information with other users.

Installing Tableau

To install tableau you need to download from the following link: https://www.tableau.com/products/desktop/download,

Tableau Desktop: Start your free 14-day trial

Image 12. Tableau Desktop.

Go to the “to a server” menu and select “Other Databases (ODBC)”. It can be seen that there are no drivers installed to connect the Mongo DB database

Image 13. No MySQL drivers for connect MongoDB BI Connector.

MongoDB BI Connector uses MySQL drivers. If you don’t have MySQL drivers installed, you need to install them.

iODBC 3.52.12 or later must be installed on the macOS system before you can install Connector/ODBC . We recommend that you install the latest → MySQL 8.0 driver.

Image 14. Instructions to install MongoDB BI Connector.

To download the driver you need to specify the computer’s operating system and the version. → DRIVER MAC

Image 15. Download MongoDB Connector for BI (Version 2.14.3 macOS x64).

Install BI Connector on macOS

The MongoDB Connector for Business Intelligence (BI) allows users to create queries with SQL and visualize, graph, and report on their MongoDB Enterprise data using existing relational business intelligence tools

To set up MongoDB Connector for BI with a business intelligence tool such as Tableau, follow the steps on this page.

Prerequisites

OpenSSL

To install OpenSSL via Homebrew, run the following command:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"brew updatebrew install openssl

Install the MongoDB Connector for BI

Extract the downloaded .zip archive.

Image 16. Extract the download .zip archive and copy files of bin folder.

Then, find the bin folder (system) through the Finder:

  1. Open Finder
  2. Press Command+Shift+G to open the dialogue box
  3. Input the following search: /usr/local/bin
Image 17. Find bin folder (system) through FINDER.

Copy the files from the bin folder that you unzipped earlier and paste into the bin folder on your system.

Image 18. Copy files from downloaded archive into bin folder.

Install the programs within the bin/ directory into a directory listed in your system PATH. If a prior version exists, overwrite the binaries.

sudo install -m755 bin/mongo* /usr/local/bin/

Finally, open the terminal and write the following code:

mongosqld

mongosqld accepts incoming requests from a SQL client and proxies those requests to a mongod or mongos instance.

You are now ready to launch the BI Connector. You can access the data of the Mongo DB database through Tableau.

Image 19. Launch the BI Connector.

Now, we select the driver shown in Image 20 and specify the database. The database is “Quake”.

Image 20. First configuration to connect MongoDB and Tableau.

We can see the data stored in Mongo DB is read through Tableau in a fast and easy way.

Image 21. ETLs read from MongoDB.

Based on the processed data we can observe an area on the globe that is known as the ring of fire, in these areas investors must consider complementary elements in terms of the supply chain, these areas tend to have higher associated prices to the assets and the insurance elements consider more severe policies to classify the risk.

Image 22. Dashboard on Tableau 2021.

By using visual elements such as charts, graphs, and maps, data visualization tools provide an accessible way to view and understand trends, outliers, and patterns in your data.

In the world of big data, data visualization tools and technologies are essential for analyzing large amounts of information and making data-driven decisions.

Thanks for your time reading this post 😃 !!

If you need to know a little more about BIG DATA (analysis and visualization), do not hesitate to contact me on Linkedin.

Github repo: https://github.com/kabirivan/mongo-pyspark-tableau

Coming Soon,
In the second part of this post, we will review machine learning with pySpark. The natural disaster dataset will be the same for this study (Prediction EarthQuake).

Can we help you?

We are ready to listen to you. If you need some help for creating the next big thing, you can contact our team on our website or at info@jrtec.io

--

--

Xavier Ivan Aguas
Allient
Editor for

I am a younger engineer with innovative ideas. I am convinced that I have the knowledge and the abilities needed in order to resolve any problem.