Mongo DB + PySpark + Tableau 2021
COMPLETE GUIDE
On this occasion, I am going to present you with a very interesting article where three technologies are combined for data analysis and visualization. To carry out this small project, Mongo DB is used for databases, PySpark was used to create ETLs (Extract, Transform, Loading), and Tableau for data visualization.
In this study, the behavior of natural disasters is analyzed, in order to obtain a spatial understanding of the areas with the greatest presence of events.
The data for the analysis is a compendium of the telluric phenomena produced from 1965 to 2016 and that have been documented through different sensors installed to observe the behavior and its impact.
The data were obtained from the repository called Kaggle. This is the dataset for the analysis: earthquakes dataset.
The data set contains 23,412 records that have been taken from the National Earthquake Information Center (NEIC), which is one of the centers in charge of the study and understanding of seismic phenomena and their effects.
This is the diagram of the proposal where the three tools mentioned above are combined.
Before starting, we are going to install Mongo DB on a MAC through the following steps:
Prerequisites
Install Xcode Command-Line Tools
xcode-select --install
Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Installing MongoDB 4.4 Community Edition
brew tap mongodb/brew
brew install mongodb-community@4.4
Alternatively, you can specify a previous version of MongoDB if desired. You can also maintain multiple versions of MongoDB side by side in this manner.
Run MongoDB Community Edition
To verify that the installation was successful we verify through running the database.
brew services start mongodb-community@4.4
and finally
mongo
If you installed Mongo DB in Catalina you probably have problems. The problem was solved through the following article: Solution
To better observe the databases created in Mongo DB we use the following program NoSQLBooster for Mongo DB, which allows to manage and make requests more easily.
Installing pySpark
PySpark is an open source framework for parallel computing using clusters. It is used especially to speed up the iterative computation of large amounts of data or very complex models.
Now we are going to install PySpark on Mac, through the following steps:
Install Java 8
Spark requires Java8:
brew install openjdk@8
and
sudo ln -sfn /usr/local/opt/openjdk@8/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-8.jdk
Install Scala
Apache-Spark is written in Scala, which is a requirement to run it.
brew install scala
Install Spark
brew install apache-spark
After installing the packages, it is good to check your system with
brew doctor
Install pySpark
If not installed already Python 3, you can do with homebrew:
brew install python3
Finally, you need to install the python Spark API pySpark:
pip3 install pyspark
pip3 install findspark
Create ETLS with pySpark
Now we are going to configure our file to create ETLs through pySpark.
Review dataset:
Then, we remove the columns that do not matter in this study:
Create a new column with the year of the natural disaster and create a new dataframe:
Now, it is important to change the data from string to numerical:
We calculate the maximum and average magnitude in each year:
Finally, we send the ETLs created to Mongo DB through the following code
Then, we open the NoSQLBooster for MongoDB program and check the ETL’s have been added.
Tableau
Tableau is a visual analytics platform that transforms the way we use data to solve problems. Plus, it enables individuals and organizations to get the most out of data.
Through simple functions such as drag and drop, anyone can easily access and analyze data, and even create reports and share this information with other users.
Installing Tableau
To install tableau you need to download from the following link: https://www.tableau.com/products/desktop/download,
Tableau Desktop: Start your free 14-day trial
Go to the “to a server” menu and select “Other Databases (ODBC)”. It can be seen that there are no drivers installed to connect the Mongo DB database
MongoDB BI Connector uses MySQL drivers. If you don’t have MySQL drivers installed, you need to install them.
iODBC 3.52.12 or later must be installed on the macOS system before you can install Connector/ODBC . We recommend that you install the latest → MySQL 8.0 driver.
To download the driver you need to specify the computer’s operating system and the version. → DRIVER MAC
Install BI Connector on macOS
The MongoDB Connector for Business Intelligence (BI) allows users to create queries with SQL and visualize, graph, and report on their MongoDB Enterprise data using existing relational business intelligence tools
To set up MongoDB Connector for BI with a business intelligence tool such as Tableau, follow the steps on this page.
Prerequisites
OpenSSL
To install OpenSSL via Homebrew, run the following command:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"brew updatebrew install openssl
Install the MongoDB Connector for BI
Extract the downloaded .zip
archive.
Then, find the bin folder (system) through the Finder:
- Open Finder
- Press Command+Shift+G to open the dialogue box
- Input the following search:
/usr/local/bin
Copy the files from the bin folder that you unzipped earlier and paste into the bin folder on your system.
Install the programs within the bin/
directory into a directory listed in your system PATH
. If a prior version exists, overwrite the binaries.
sudo install -m755 bin/mongo* /usr/local/bin/
Finally, open the terminal and write the following code:
mongosqld
mongosqld accepts incoming requests from a SQL client and proxies those requests to a mongod
or mongos
instance.
You are now ready to launch the BI Connector. You can access the data of the Mongo DB database through Tableau.
Now, we select the driver shown in Image 20 and specify the database. The database is “Quake”.
We can see the data stored in Mongo DB is read through Tableau in a fast and easy way.
Based on the processed data we can observe an area on the globe that is known as the ring of fire, in these areas investors must consider complementary elements in terms of the supply chain, these areas tend to have higher associated prices to the assets and the insurance elements consider more severe policies to classify the risk.
By using visual elements such as charts, graphs, and maps, data visualization tools provide an accessible way to view and understand trends, outliers, and patterns in your data.
In the world of big data, data visualization tools and technologies are essential for analyzing large amounts of information and making data-driven decisions.
Thanks for your time reading this post 😃 !!
If you need to know a little more about BIG DATA (analysis and visualization), do not hesitate to contact me on Linkedin.
Github repo: https://github.com/kabirivan/mongo-pyspark-tableau
Coming Soon,
In the second part of this post, we will review machine learning with pySpark. The natural disaster dataset will be the same for this study (Prediction EarthQuake).
Can we help you?
We are ready to listen to you. If you need some help for creating the next big thing, you can contact our team on our website or at info@jrtec.io