Data Visualization

Naveen Kumar
Praemineo
Published in
8 min readNov 27, 2018

Data visualization refers to the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization is an accessible way to see and understand trends, outliers, and patterns in data.

There are so many ways of data visualization, How we will decide for best data visualization for specific data sets?

Its depend on What would you like to show?

→ Comparison

→ Distribution

→ Relationship

→ Composition

Let’s discuss some data Visualization Python Libraries

Matplotlib: Matplotlib is a python library used to create 2D graphs and plots using python scripts. It has a module named pyplot, which makes things easy for plotting by providing feature to control line styles, font properties, formatting axes etc. It supports a very wide variety of graphs and plots namely — histogram, bar charts, scatter plots etc.

Bar Plots: A bar plot is a chart that uses bars to show comparisons between categories of data. Bar plots are most effective when you are trying to visualize categorical data that has few (probably < 10) categories.

Regular Bar Plot:

Bar plots are generally used for comparison so let’s an example in which we will compare some programming languages with their usage. Source code

Grouped Bar Plot: Grouped bar plots allow us to compare multiple categorical variables.

Taking an example in which we want to analyze a gaming score of Gender separately as Men and Women in which male and female together making a group. Source code

Stacked Bar Plot: A stacked bar graph (or stacked bar chart) is a chart that uses bars to show comparisons between categories of data, but with ability to break down and compare parts of a whole. Each bar in the chart represents a whole, and segments in the bar represent different parts or categories of that whole.

Let’s take same example, which is used in Grouped bar plot, but now we want to see the Gender Scores with their respective Group Score by which We can compare with other groups. Source code

Line Plots: A line chart or line graph displays the evolution of one or several numeric variables. Data points are connected by straight line segments. It is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined by straight line segments. A line chart is often used to visualize a trend in the data over intervals of time – a time series – thus the line is often drawn chronologically.

Eg: Let’s take a data set in which Population of India and Pakistan is given from year 1960 to 2010 so, Line graph will help to analyze How theirs Population is changed from1960 to 2010. Source code

Pie charts: A Pie Chart (or Pie Graph) is a special chart that uses “pie slices” to show relative sizes of data. The chart is divided into sectors, where each sector shows the relative size of each value.

Eg: Taking an example, having four different programming languages with their size(assume size as a score of exam paper), pie chart provides their divided sectors with respect to the total size. Source code

Histograms: A histogram is a type of graph that is widely used in mathematics, especially in statistics. The histogram represents the frequency of occurrence of specific phenomena which lie within a specific range of values, which are arranged in consecutive and fixed intervals. The frequency of the data occurrence is represented by a bar, hence it looks very much like a bar graph.

Normal Histogram: Something very important to note is that histograms are not bar charts. In a bar chart, the height of the bar represents a numerical value but each bar itself represents a category — something that cannot be counted, averaged, or summed.

Eg: lets take a simple example for better understanding we have data with numeric values and we want to see its distribution by using a normal histogram.

import matplotlib.pyplot as plt
from numpy.random import normal
gaussian_numbers = normal(size=1000)
plt.hist(gaussian_numbers)
plt.title(“Gaussian Histogram”)
plt.xlabel(“Value”)
plt.ylabel(“Frequency”)
plt.show();

Seeing one distribution is helpful to give us a shape of the data, but how about two?

Overlaid Histogram: Overlaid Histogram is used for Comparing the distribution of two numeric data of any specific dataset.

Eg: Let’s take another simple example, Following example, provide the distribution of Gaussian and Uniform data.

import matplotlib.pyplot as plt
from numpy.random import normal, uniform
gaussian_numbers = normal(size=1000)
uniform_numbers = uniform(low=-3, high=3, size=1000)
plt.hist(gaussian_numbers, bins=20, histtype='stepfilled', normed=True, color='b', label='Gaussian')
plt.hist(uniform_numbers, bins=20, histtype='stepfilled', normed=True, color='r', alpha=0.5, label='Uniform')
plt.title("Gaussian/Uniform Histogram")
plt.xlabel("Value")
plt.ylabel("Probability")
plt.legend()
plt.show();

The two distributions look similar, but not the same (the third color is where they overlap).

Seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Scatter Plots: Scatter plots are used to plot data points on a horizontal and a vertical axis in the attempt to show how much one variable is affected by another.

Eg: Lets take an example of dataset tips having total_bill and tip and we want to see the relation between these two values total_bill and tip.

import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
tips = sns.load_dataset("tips")
ax = sns.scatterplot(x="total_bill", y="tip", data=tips)

Colour Groupings: Colour grouping use for more than one category data, like taking same above example, but we want total_bill and tip relatively data distribution of colour grouping separated by Lunch and Dinner.

ax = sns.scatterplot(x="total_bill", y="tip", hue="time", 
data=tips)

Size Encoding: Size encoding used for providing density and size with data distribution.

Eg: Taking the same above example for total_bill and tip relation’s distribution and this example also providing the density and size of their relative data.

cmap = sns.cubehelix_palette(dark=.3, light=.8, as_cmap=True)
ax = sns.scatterplot(x="total_bill", y="tip",
hue="size", size="size",
sizes=(20, 200), palette=cmap,
legend="full", data=tips)

Heatmap: A heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors.

Eg: Let’s take a dataset flights in which we have No. of passengers, months and year and we want to analyze no. of passengers months wise with the corresponding year.

import numpy as np; np.random.seed(0)
import seaborn as sns; sns.set()
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
ax = sns.heatmap(flights)

Annotate each cell with the numeric value using integer formatting:

ax = sns.heatmap(flights, annot=True, fmt="d")

This flight’s heatmap clearly shows that no. of passengers in flight were 112 in January 1949, 115 in January 1950 and so on.

Box Plots: Box plot is probably one of the most common type of graphic. It gives a nice summary of one or several numeric variables. The line that divides the box into 2 parts represents the median of the data. The end of the box shows the upper and lower quartiles. The extreme lines show the highest and lowest value excluding outliers. Note that box plot hide the number of values.

Eg: For better understanding of Box plotting, Let’s take a dataset tips having one of the columns is total_bill, by using box plots we analyze the statistic distribution of total_bill.

import seaborn as sns
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.boxplot(x=tips["total_bill"])
numeric data(total_bill) statistic distribution on single box plot

And if we want the statistic distribution of total_bill by day basis, then we can use multi box plot with different colours for different days.

ax = sns.boxplot(x="day", y="total_bill", data=tips)

Violin Plots: A Violin Plot is used to visualize the distribution of the data and its probability density. This chart is a combination of a Box Plot and a Density Plot that is rotated and placed on each side, to show the distribution shape of the data.

The following diagram show, how box plot and violin plot are almost same.

Violin plot is same as Box plot

Eg: Taking the same dataset of Bill and let’s see how it’s distributed on violin plot.

import seaborn as sns
sns.set(style=”whitegrid”)
tips = sns.load_dataset(“tips”)
ax = sns.violinplot(x=tips[“total_bill”])
numeric data(total_bill) statistic distribution on single violin plot

And if we want the statistic distribution of total_bill by day basis, then we can use multi violin plot with different colours for different days with their probability density.

ax = sns.violinplot(x=”day”, y=”total_bill”, data=tips)

These are some quick and easy data visualizations using Matplotlib and Seaborn python libraries. Thank you for the read. I hope you enjoyed this post and learned something new and useful. If you like it, please hold the clap button and share it with your friends.

--

--