Visual Exploratory Data Analysis(EDA) Part 1

3 min readMay 6, 2019

--

The Bible Corpus, provided by Oswin Rahadiyan Hartono on Kaggle offers comprehensive indexing for seven different translations of the Bible.

I am going to briefly go over how I use visualization during my EDA.

Libraries I import

First, import Pandas, Seaborn, and Pyplot as pd, sns , and plt

Loading the CSVs using Pandas

**Fig 1.** The Darby English Version(DEV) was not loading properly. See error message below.

Fig. 2 Darby English Version Also not loading properly due to error message below

Fig 3. Due to the error above, we will proceed to visualize contents using the 6 versions instead of all 7.

Exploring the keys for the translations:version_keys

**Fig 4.**Call in the dataset features for version keys. Not much to plot here but good to know how version keys are organized in a dataset

Next up for exploration: key_abbrev

First, I display the first 10 rows of the dataset usingkey_abbrev.head(10)

**Fig 5.** I can see the different abbreviations for the book names under the ‘a’ column. Might be worth plotting to better understand how the books are organized

Plotting key_abbrev

**Fig 6.** I use Pyplot and Seaborn to establish figure size and keyword arguments(kwargs) to set my x and y axis

**Fig 8.** I use the **font_scale** kwarg to set the right scale size for the x axis labels. I also set the ci kwarg to False to get rid of the confidence interval. See below in **Fig. 9**

**Fig 10.** Use Pyplot for axis description and axis size. See **Fig 11** below for resulting graph.

Informed by the graph above, I wanted to see the name and count for Book ID #35 using key_abbrev

**Fig 12.** Recall the column names above

**Fig 13.** It looks like the ID number under column name ‘b’ is collapsed by the default setting to save display real estate

**Fig 14.** I use this command to display the maximum number of rows for the rows of dataset I can display

**Fig 15.** Now I can see that there is only one abbreviation for the book of ‘Hab’ , also labeled as ID #35

Plotting different features from key_abbrev

**Fig 16.** This time, using columns ‘b’ and ‘a’ instead of ‘b’ and ‘p’ from **Fig 6**

Figures 17–19 can get pretty messy but I included them here to show where I started. After trying out multiple scale sizes for the abbreviation names listed on the y axis, I realize that I might not even need them as the count for each book acronym remains consistent.

Notice the stair-shaped graph patterns as we move from the Old Testament books to the New Testament books.

So after trying out a few ways of plotting features ‘a’ and ‘b’, I decided to dramatically increase the figure size as shown in in Fig 20.

Clearly label the axes to explain the story behind the graph using code in Fig 22:

The figures above show how graphs could be used to visualize features of a dataset to help supplement EDA.

Stay tuned for Part 2 where I will be using different types of graphs to learn more about the rest of the datasets.

Visual Exploratory Data Analysis(EDA) Part 1

Written by M L