Tanvi Agarwal
The Startup
Published in
4 min readAug 14, 2020

--

Simple Data Visualization

Data Science is the art of story-telling as it is the way to explain to people how beautiful and useful the data is, to those who are not aware of data by transforming it into some understandable form. So, data visualization is one of the strongest tools or say steps in Data Science to translate data in a form that everyone can understand.

This post is for the beginners who just started with data visualizations for EDA.

What is Data Visualization?

Data Visualization is a graphical representation of the information and data to make it useful and understandable by everyone. This is done by using visual tools including charts, graphs and maps.

Today, we are surrounded by huge data from all the aspects of life, be it social, technical, personal and medical. So, to deal with the data data scientists perform various steps to transform that data in some usable form and data visualization is one the ways in which data is allowed to take form that everyone could read. Since it is said, “A picture is worth a thousand words”, the same goes with data.

Data Visualization is attempted at two times of training the model to study the dataset, viz. while performing EDA and later at the conclusion of the analysis to check correctness, accuracy, prediction etc. EDA(Exploratory Data Analysis) is a step in data science methodology in which a person (more specific, the one who studies data) analyzes the data, to get familiarized with it, performing all the manipulations to remove indiscrepancies in the data. In this sequence analysis isn’t complete without having done with visualizations.

Data Visualizations is best performed with libraries including matplotlib, seaborn and tableau application. In this I will focus on matplotlib.

>>Import matplotlib.pyplot as plt

Dataset Used

To understand better, it is recommended to implement what you learn, so I am going to take an example of a dataset to show how visualization is helpful.

Dataset taken is: https://www.kaggle.com/saurograndi/airplane-crashes-since-1908/downloa

Understanding Dataset

Dealing with a dataset is the next step before actual visualizations. In the dataset “Airplane Crashes Since 1908”, the number of entries : 5268 and the number of features : 13.

Features Include

Let’s see how does the dataset look like:

Dataset

Data Manipulation

Data Manipulation is a crucial task, it is the process of changing the data to vanish discrepancies and removing missing values or changing them to make the data easier to implement and study.

To do this we need to check out for discrepancies including missing values, outliers and so on. In this article I have focused on missing values and dealing with them.

After checking for missing values, I found the following result:

% of Missing Values

This picture depicts the percentage of the missing values in different feature column of the dataset. So, it can be concluded that the columns to be neglected are : Time, Flight #, Route, Registration,cn/In, Summary; having crucial amount of missing data. But we are not going to remove Summary as it holds some important values for the various entries.

So, deleting Time, Flight #, Route, Registration,cn/In and further dropping missing values from remaining features to get a perfect dataset to perform visualization.

Data Visualization

So, finally let’s perform simple graph visualization to calculate average survival rate.

In the dataset we have now after data manipulation, we will calculator survival rate as:

data_copy[“Survival Rate”] = 100 * (data_copy[“Aboard”] — data_copy[“Fatalities”]) / data_copy[“Aboard”]

>>data_copy_mean = data_copy[“Survival Rate”].mean()

>>survival_per_year = data_copy[[“Date”,”Survival Rate”]].groupby(data_copy[“Date”].dt.year).agg([“mean”])

>>survival_per_year.plot(legend=None)

>>plt.ylabel(“Average Survival Rate, %”)

>>plt.xlabel(“Year”)

>>plt.title(“Average Survival Rate/Year”)

>>plt.xticks([x for x in range(1908,2009,10)], rotation=’vertical’)

>>plt.axhline(y=data_copy_mean, color=’g’, linestyle=’ — ‘)

>>plt.show()

So, performing visualization we get following graph:

Graph for Average Survival Rate

The average survival rate per year is ~16.75%.

Conclusion:

  1. Identified missing values.
  2. Dealt with missing values.
  3. Performed visualization using matplotlib.pyplot library
  4. Found average survival rate with the help of visualized graph : ~16.75%

For code check here : https://www.kaggle.com/tanvi05/datavisualization-airplane-crashes-since-1908

--

--

Tanvi Agarwal
The Startup

A techie gal with a writer's heart, just trying to help everyone as I learn, explore nd share.