Short Introduction in Data Visualization

Alexander Osadchenko
softplus-publication
6 min readNov 28, 2021

Humans learn and understand more by visualizing. According to the University Of California, 80% of human learning happens through the visual representation of information. The most obvious example of visualizing data are geographical maps. It is easier for you to find your destination by using a map rather than having words explaining where you need to go.

So based on a simple example with geographical maps we already can define what is data visualization.

Data visualization — is a technique for translating information into a visual context to be more understandable for us humans.

Why do we need data visualization? Who uses data visualization?

The main field where data visualization is used is statistical analysis. We can represent how salary was changed throughout the years, compare salaries in different countries, analyze the median salary on the planet. The idea is that based on this data we can make some decisions. If you see that your salary is lower than the median on the graph, probably you should find a better place with a better salary. For businesses is especially important to understand if the market is growing, if the app is profitable and to understand what place in the market the business is occupying.

Plots Description

Nowadays exists many types of plots which help us to represent data visually. I will enumerate and explain the most popular and useful for statistical analysis and machine learning.

Line Graph

Line Graph or in Machine Learning we call it Time Series is probably is one of the most popular graphs. We see it on television when people discuss stock price changes or when we see statistics on global warming over the past 100 years.

On the Ox axes, we usually see timespan and on the Oy, we see some quantitative variable.

Average temperature in Bucharest from January 2019 till may 2020

Example of code:

df.plot.line(x='Date', y='AvgTemperature', figsize = (20, 10))

Area Chart

An area chart is a variant of a time series but with the semi-transparent colored area under the line. Usually, area charts are used if you want to show how multiple values develop over time. For example, prices on energy produced by different types of resources.

Line plot on positive temperature between January 2019 and May 2020

Example of code:

df =df[df['AvgTemperature'] >= 0]df.plot.area(x='Date', y='AvgTemperature', figsize = (20, 10))

Bar Chart

The bar chart represents categorical data. That means that it uses 2 or more variables. Usually, we use them if we want to show how values differ in the different categories. For example, compare median salary in different countries.

On the Ox, we see the variables which represent the category and on the Oy, we use one variable which will be the criteria to compare different categories.

Comparison max temperature between Moscow, Russia and Romania, Bucharest

Example of code:

max_temp_df.plot.bar(x='Country', y='AvgTemperature', rot=0)

Histogram

Histogram graph looks very similar to the bar charts, but it is used differently. First of all in histograms we use numerical data over categorical which is used in the bar chart. The main goal is to represent graphically the frequency of numerical data. Histograms encode numerical data by both length and width, also known as area. The area of each bar in a histogram should be proportional to the data. Data Analysts frequently use this plot to determine if the data is distributed normally, by Gaussian Distribution. For example, we can use a histogram if we want to analyze the property sales in a year by price ranges starting with the range from 0 to 20k euro and till the range of 100k euro and more.

Distribution of the heights measured in inch

Example of code:

plt.hist(height, bins = 10)plt.xlabel('Bins')plt.ylabel('Frequency')plt.title('Histogram for height')plt.show()

Scatter plot

This type of graph usually is not used in public reports and we do not see them in the mass media. A Scatter plot is more specific to the pure statistic field and is used to find correlations between two variables. We can try to detect correlations between the height of the person and his weight. In general, we will see a positive correlation, but this is not always the case. If we will try to find correlations between the color of the cat’s eyes and what type of food he likes, we will probably find no correlation at all.

A strong correlation between Height (inch) and Weight (pounds).

Example of code:

weight = df['Weight']height = df['Height']plt.scatter(height, weight)plt.xlabel('Height')plt.ylabel('Weight')plt.title("Height vs Weight")plt.show()

Pie chart

One of the most common charts. Frequently used to show the distribution between some categorical data. For example distribution of petrol consumption between countries. The arch length of each slice is proportional to the quantity it represents.

Distribution of the males and females in the dataset

Example of code:

labels = df.Gender.unique()sizes = [(df.Gender == labels[0]).sum(), (df.Gender == labels[1]).sum()]plt.pie(sizes, labels = labels, autopct='%1.1f%%', startangle = 90)plt.title("Checking if data is balanced or not with a pie chart")plt.show()

Boxplot

One of the most unusual for the simple user and one of the most important for machine learning engineers is Boxplot. It is used to display the distribution of the data based on five criteria: minimum, first quartile — Q1, median, third quartile — Q3, maximum. In the center, we see the median — middle value in the dataset. It means that the number of values from the left side is equal to the number of values from the right side of the median. Between Quartile 1 and quartile 3 are located 50% of data which are near the median. Everything which is below the minimum and above the maximum are considered outliers. Usually, this is only 0.7% of our data which represents values that are too high or too low compared to the other data included between minimum and maximum. For example, we want to compare prices of the real estate in the Republic of Moldova. Even without a plot, we can detect which will be the outliers. Probably the outliers will be prices which are higher than 100 000 euros.

Boxplot for weight (pounds)

Example of code:

labels = df.Gender.unique()sizes = [(df.Gender == labels[0]).sum(), (df.Gender == labels[1]).sum()]plt.pie(sizes, labels = labels, autopct='%1.1f%%', startangle = 90)plt.title("Checking if data is balanced or not with a pie chart")plt.show()

Conclusion

It is crucial for the analyst and machine learning engineer to understand plots. They give insights about what step you should do next. Clear the data from the outliers or detect if variables have a correlation. Also, it helps you to be more pragmatic and not to be misled by media which frequently manipulate your opinion based on some invented charts.

--

--