How can you make up your data? (Part-2)

Ahmet Talha Bektaş
6 min readNov 21, 2022

--

Photo by Amy Shamblen on Unsplash

This is the second part in a two-part series. If you want you can start with the first part.

Table of contents:

Seaborn

As we mentioned in the previous part, data scientists generally use the seaborn library for data visualization.

This graph was created by me from the Kaggle dataset.

Seaborn is a library for creating graphs. Seaborn is based on matplotlib; as a consequence, we can also use matplotlib tools with seaborn.

Before we start coding, you can find the notebook for this article in my Kaggle or on my GitHub.

Let’s import the necessary libraries!

import matplotlib.pyplot as plt
#We imported matplotlib as "plt".
%matplotlib inline
import seaborn as sns
#We imported seaborn as "sns".

There are datasets inside the seaborn library. If you want to see dataset names :

sns.get_dataset_names()

Output:

I will use the titanic dataset again but this time I will use the seaborn titanic dataset.

titanic=sns.load_dataset("titanic")

EDA

If you don’t know “EDA”, I highly recommend you to read this article.
Or if you want to learn more about filtering data, you should read this article.

titanic.head()

Output:

titanic.tail()

Output:

titanic.sample(5)

Output:

titanic.info()

Output:

titanic.shape

Output:

titanic.isnull().sum()

Output:

I haven’t shown you how to fill in empty data; thus, I will not empty age data.

titanic_full_age=titanic[titanic["age"].notnull()]
titanic_full_age.isnull().sum()

Output:

lineplot

To monitor changes over both short and long time periods, line plots are utilized. Line plots perform better than bar graphs when there are smaller changes. Line plots can be used to compare changes for multiple groups over the same time period.

sns.lineplot(data=titanic_full_age[["age","fare"]])
#We said graph a line plot and the data is "titanic_full_age"
#and also give 2 columns and it puts automatically on the x-axis.

Output:

sns.lineplot(data=titanic_full_age,x="age",y="fare");
#Our data is "titanic_full_age"

#Put "age" column values on the x-axis.

#Put "fare" column values on the y-axis.

#In addition if you put ";" at the end of the code,
#you will not see something like "<AxesSubplot:xlabel='age', ylabel='fare'>"

Output:

There are set tyles in seaborn. These are preparing the background of the figure. Very common styles are darkgrid, whitegrid, dark, white, and ticks.

sns.set_style("darkgrid")
sns.lineplot(data=titanic_full_age,x="age",y="fare");

Output:

sns.set_style("whitegrid")
sns.lineplot(data=titanic_full_age,x="age",y="fare");

Output:

sns.set_style("white")
#Cleaning the style of graph.

scatterplot

When two variables match well together, use a scatter plot.

sns.scatterplot(x="age",y="fare",hue="survived",data=titanic_full_age);
#Create a scatter plot about the "survived" column

Output:

histplot

One common graphing tool is the histogram. It is employed to present interval-scaled summaries of discrete or continuous data. It is frequently used to conveniently depict the main characteristics of the data distribution.

sns.histplot(data=titanic_full_age, x="who");
#Create a histogram and put "who" column values on the x-axis.

Output:

sns.histplot(data=titanic_full_age, y="who", fill=False);
#Create a histogram and put "who" column values on the y-axis.
#Do not fill the inside of rectangles!

Output:

sns.histplot(data=titanic_full_age, x="who",kde=True);
#kde=kernel density estimate
#We can easily see the distribution by "kde"

Output:

sns.histplot(data=titanic_full_age, x="pclass", hue="survived");
#We can also compare elements in contrast to a column.
#In this case, we compare "pclass" column values in contrast to the "survived" column.

Output:

However, When blue and orange overlap, we see black.

sns.histplot(data=titanic_full_age, x="pclass", hue="survived",multiple="stack");
#You can use multiple="stack" attributes.

Output:

Or

sns.histplot(data=titanic_full_age, x="pclass", hue="survived", element="step");
# element=" step" attributes.

Output:

displot

displot is very similar to the histplot.

sns.displot(titanic_full_age.age,bins=10,kde=True);
#bins=n means show us the number of rectangles "n" in the graph.

Output:

sns.displot(titanic_full_age.age,bins=50,kde=True);

Output:

boxplot

Boxplots are used to display the distributions of numerical data values, particularly when comparing them across various groups. They are designed to give high-level information at a glance and provide details like the symmetry, skew, variance, and outliers of a set of data.

sns.boxplot(x="age",y="sex",data=titanic_full_age,hue="survived");
#Creating a boxplot

Output:

countplot

To display the counts of observations in each category bin using bars, use the countplot method.

sns.countplot(x=titanic_full_age.embarked);
#It is automatically counting values. You don't need to write ".value_counts()"

Output:

sns.countplot(y=titanic_full_age.embarked);
#Without giving data

Output:

sns.countplot(x=titanic_full_age.embarked,hue=(titanic_full_age["survived"]));

Output:

barplot

You may quickly compare several sets of data among various groups using a barplot.

sns.barplot(data=titanic_full_age, x="pclass", y="age");
#creating a bar plot

Output:

sns.barplot(data=titanic_full_age, x="pclass", y="age",hue="survived");

Output:

heatmap

This is a very important graph type that data scientists use a lot.

sns.heatmap(titanic_full_age.corr(), annot = True);
#You should give a correlation inside this graph.
#annot=True means show the numbers of correlation in the squares

Output:

sns.heatmap(titanic_full_age.corr(), annot = False,cmap="crest");
#annot=False means do not show numbers inside of squares
#cmap=colour map

Output:

pairplot

sns.pairplot(titanic, hue="survived");
#Creating a pair plot.

plt.savefig("titanic.png",dpi=300)
#saving graph by matplotlib

Output:

sns.pairplot(titanic, hue="survived", kind="kde");
#It is automatically creating a scatterplot but
#if you want to change, you can by "kind" method.

#You can use ['scatter', 'kde', 'hist', 'reg'] by kind.

plt.savefig("titanic_kde.png",dpi=300)

Output:

Now you know how to make up your data. If you want to be more expert about seaborn, you should search this website.

Conclusion

Data visualization is one of the crucial things for a Data Scientist. You have learned how to present your data by graphs with these two parts. I hope you enjoyed😊

How we can deal with missing values?” has come! You should read that article for handle missing values!

Author:

Ahmet Talha Bektaş

If you want to ask anything to me, you can easily contact me!

📧My email

🔗My LinkedIn

💻My GitHub

👨‍💻My Kaggle

📋 My Medium

--

--