Day 7 of 100DaysofML

Published in

100DaysofMLcode

6 min readJun 23, 2020

Visualization in Python. This is a fundamentally important concept because data exists all the way through in our journey for Data Science or Machine Learning. Understanding the data is a very important thing and it can be done in a number of ways.
2 of the most prominent libraries that are used in Python are matplotlib and Seaborn. The key difference between the two is in the complexity mainly. Matplotlib is mainly used for basic plotting of data and can be used for plotting of bar graphs, pie charts and other basic visuals but on the other hand, seaborn is used for providing a number of visualization patterns. We shall discuss the code and a few simple examples of them below. Another key thing to note is that these two can be used simultaneously as well.
Some of the common syntax across both of them are:
1. .plot() -Used to plot the graph after the data has been initialized onto it
2. .xlabel({Whatever x-axis label}) -Used to specify the label on x-axis
3. .ylabel({Whatever y-axis label}) -Used to specify the label on y-axis
4. .show() -Used to display the graph that has been initialized
5. .axis() -To change the axis limits
6. .figure() -To create a new figure window
7. .title() -Used to give a title to the plot or graph

Alright. Let's get our hands dirty. Going to do just the basics so it will be easy to catch up and just cover the essentials of what most people use while working with data visualization in Data Science.
Let's start by getting our libraries imported in our environment. Since we will be using a mix of matplot.pyplot and seaborn, install (if you haven’t) and import both of them.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

%matplotlib inline is something that is used in order to make sure that the plots created by matplotlib are shown in-line or within the same line in the notebook.
The next important step is your DATA. Import your data using pandas. If you need help with using pandas, refer to Day 6 of my publication. But get your own datasets imported and adjust your variable names according to my syntax. I shall be importing my own syntax but I shall include a snippet of my dataset as well. Alright then.

First, I’m going to start off by using a FIFA dataset which contains the statistics of teams calculated on different years. The snippet of its head() is given below:

Snippet of first 5 values of **FIFA dataset**

Lineplot: Use the syntax mentioned to get an idea of the usage of Lineplot but it is a greater way of understanding the variation in data over a period of time.

fifa_data=pd.read_csv('fifa.csv',sep=',')
plt.figure(figsize=(16,6))
sns.lineplot(data=fifa_data)

figsize is used to define the height and width of the figure created.
.lineplot({data}) is used to create the lineplot of the above data and it can be visualized as:

Lineplot of different teams over the given years

Lineplots are very useful when we want to compare the trends of data over a period of time and like I had mentioned earlier, sns helps us a lot with the visualization and its plots.

Let's look at another example using spotify data of the songs that are being played. We use the keyword parse_date since we work with dates in our data frame.

spotify_data = pd.read_csv(spotify_filepath, index_col="Date", parse_dates=True)
spotify_data.head()

sns.lineplot(data=spotify_data)

On plotting the lineplot of the data, we obtain:

2. Barcharts: We have worked with Barcharts and in case you haven't, I’d suggest having a read through on what barcharts are used for.
For the following barcharts which I shall create, I’ve used a flight data dataset. You can find any of these datasets on the internet.

flight_data=pd.read_csv('flight.csv',sep=',')

The syntax of barplot is pretty straightforward. It consists of 2 main parameters. x for the things that go onto the x axis and y for things that go onto the y axis. Here, we take only one specific column and plot the data for that column.
sns.barplot(x={},y={})

plt.figure(figsize=(10,6))
plt.title("Average Arrival Delay for Spirit Airlines Flights, by Month")
sns.barplot(x=flight_data.index, y=flight_data['NK'])
plt.ylabel("Arrival delay (in minutes)")

Barplot for flight_data for the given column

You can always play around with the columns and the x along with y axis to get more familiar.

3. Heatmaps: Heatmaps is something that is used in order to get an understanding of the data. I shall explain with an example. Let us take the same dataset and perform and print the heatmap for the same:

plt.figure(figsize=(10,7))
plt.title("Average Arrival Delay for Each Airline, by Month")
sns.heatmap(data=flight_data, annot=True)
plt.xlabel("Airline")

annot=True - This ensures that the values for each cell appear on the chart.

What can we infer from these?
It may be seen that the months towards the end of the year are relatively dark which suggests that the airlines on an average are better at maintaining schedule during this period. The completely white values are way off in terms of their range.

4. Scatter Plots: As the term suggests, a scatter plot is used to plot points as they are on a given graph. It helps in visually understanding the vast data that is present and can also be used to identify the outliers.
For the following plot, we are using an insurance based dataset:

insurance_data = pd.read_csv(insurance_filepath)

The insurance data is read and then the head() is displayed as:

The syntax for scatter plots are also quite straight forward whereby the x and y dataframes are mentioned.
sns.scatterplot(x={},y={})

sns.scatterplot(x=insurance_data['bmi'],y=insurance_data['charges'])

There is a lot of variation with lines and colors that can be done with scatter plots and there is a lot to understand once the user gets his hands on. Some of the tools that can be used are syntax are: .swarmplot(), .lmplot() etc.

Scatter plot for the given insurance dataset

5. Histograms: Histograms are quite similar to bar plots but there quite a few significant differences. For the given, we are using the IRIS dataset:

iris_data = pd.read_csv(iris_filepath, index_col="Id")
iris_data.head()

sns.distplot(a=iris_data['Petal Length (cm)'], kde=False)

The above code plots the histogram for us.
a= chooses the column we'd like to plot (in this case, we chose 'Petal Length (cm)').
kde=False is something we'll always provide when creating a histogram, as leaving it out will create a slightly different plot.

Histogram for the given **‘Petal Length’** from dataset

That’s it for today. Keep Learning.

Cheers.

Day 7 of 100DaysofML

Written by Charan Soneji