Visual Exploratory Data Analysis(EDA) Part 1

M L
3 min readMay 6, 2019

--

Using a Bible dataset

The Bible Corpus, provided by Oswin Rahadiyan Hartono on Kaggle offers comprehensive indexing for seven different translations of the Bible.

I am going to briefly go over how I use visualization during my EDA.

Libraries I import

First, import Pandas, Seaborn, and Pyplot as pd, sns , and plt

Loading the CSVs using Pandas

Fig 1. The Darby English Version(DEV) was not loading properly. See error message below.
Fig. 2 Darby English Version Also not loading properly due to error message below
Fig 3. Due to the error above, we will proceed to visualize contents using the 6 versions instead of all 7.

Exploring the keys for the translations:version_keys

Fig 4.Call in the dataset features for version keys. Not much to plot here but good to know how version keys are organized in a dataset

Next up for exploration: key_abbrev

First, I display the first 10 rows of the dataset usingkey_abbrev.head(10)

Fig 5. I can see the different abbreviations for the book names under the ‘a’ column. Might be worth plotting to better understand how the books are organized

Plotting key_abbrev

Fig 6. I use Pyplot and Seaborn to establish figure size and keyword arguments(kwargs) to set my x and y axis
Fig 7. There is a lot going on here. I would like to clean it up by getting rid of the confidence interval lines and creating a gap between the numbers on the x axis
Fig 8. I use the font_scale kwarg to set the right scale size for the x axis labels. I also set the ci kwarg to False to get rid of the confidence interval. See below in Fig. 9
Fig 9. I can clearly see the numbers on the x axis and the graph is not as cluttered as in Fig 7. One more adjustment I would make is to clearly label the axis
Fig 10. Use Pyplot for axis description and axis size. See Fig 11 below for resulting graph.
Fig 11. Now that the abbreviation keys are clearly graphed and labeled, I can see that Book ID #35 has the highest number of abbreviations

Informed by the graph above, I wanted to see the name and count for Book ID #35 using key_abbrev

Fig 12. Recall the column names above
Fig 13. It looks like the ID number under column name ‘b’ is collapsed by the default setting to save display real estate
Fig 14. I use this command to display the maximum number of rows for the rows of dataset I can display
Fig 15. Now I can see that there is only one abbreviation for the book of ‘Hab’ , also labeled as ID #35

Plotting different features from key_abbrev

Fig 16. This time, using columns ‘b’ and ‘a’ instead of ‘b’ and ‘p’ from Fig 6
Fig. 17
Fig. 18
Fig 19

Figures 17–19 can get pretty messy but I included them here to show where I started. After trying out multiple scale sizes for the abbreviation names listed on the y axis, I realize that I might not even need them as the count for each book acronym remains consistent.

Notice the stair-shaped graph patterns as we move from the Old Testament books to the New Testament books.

So after trying out a few ways of plotting features ‘a’ and ‘b’, I decided to dramatically increase the figure size as shown in in Fig 20.

Fig 20
Fig 21. This graph clearly shows an increase in the count of Book IDs in the key_abbrev dataset

Clearly label the axes to explain the story behind the graph using code in Fig 22:

Fig. 22
Fig 23. Ordered from left to right

The figures above show how graphs could be used to visualize features of a dataset to help supplement EDA.

Stay tuned for Part 2 where I will be using different types of graphs to learn more about the rest of the datasets.

--

--