Essential Libraries To Have In Your Toolbox For Data Science And ML — Series #3 — Visualization Libraries

Kaan Ceylan
9 min readApr 15, 2022

--

Welcome to the 3rd post of the series! This time we’re diving into visualization. We’re going to learn about how we can utilize different visualization techniques and the ways they help us make business decisions as we’re taking a look at some of the most popular data visualization libraries you can use with Python. If you’ve missed the previous posts, you can check them out here and here. Though they are not essential for you to make sense of this one but I highly recommend it as visualization is part of EDA and they all tie the topic together. Let’s get started!

The Importance Of Visualization and Some Use Cases

Making sense of a bunch of numbers that go on for thousands of rows is not an easy task for anyone. So we’re making use of various visualization techniques to gain better insight into the data that we’re working with. There are countless benefits to being proeficient with data visualization and producing high quality graphs. It lets you convey your ideas more easily. This is especially important if the people you’re working with or presenting to are not familiar with statistics and are not used to reading raw statistical values of the data. By plotting your ideas, you will definitely have an easier time with your customers or managers when you’re presenting to them. Being a better storyteller is very important when conveying abstract ideas/findings and what better way to tell your story than some visually pleasing charts.

By visualizing information, we turn it into a landscape that you can explore with your eyes, a sort of information map. And when you’re lost in information, an information map is kind of useful. ―David McCandless

Charts go everywhere data goes, you can find them in economics, sports, weather forecasting, business analytics, you name it. A better visualization usually leads to better decisions and better decisions lead to improved products, increased customer retention rates, lower churn rates and everything in between. So without further ado, let’s get into the details.

Popular Chart Types

Let’s first talk a little bit about some of the most popular and useful chart types, then we can get to implementing them and get our hands dirty.

Bar Chart

The bar chart or bar graph is without a doubt the most popular there is when it comes to graph types. You’ve probably seen it a million times by now. It is usually used to visualize the distribution of categorical variables.

You can also use “kernel density estimation” to visualize the distribution of the variable, which results in a less cluttered chart with better interpretability.

Bar charts (I’ll refer to them as plots from now on.) are plotted along two axes, usually used to examine categorical variables of the data. One of the axis showing the different categories of the given data (usually the X axis) while the Y axis is showing the count. It’s relatively easy to read, the higher the bar goes, the more frequent that category of variable is observed in the given data.

Or if the data that you’re looking into is not categorical but continuous, you can choose the width of your bins or change your bin sizes to control the interval that each bin covers.

If you set your bin widths correctly (there are some rules of thumb for choosing bin sizes) and set your variables and axes correctly, bar plots are a pretty good way to see the distribution of your data and get an overview of any skewness or outliers.

Line Plots

Line plots are not that popular but they are one of the simplest type of plots to get right and read. They are mostly used when dealing with time-series data (data that is ranging over a period of time). In that case the X axis represents the time change and the Y axis represents the data values. The points in time are usually separated at a set interval.

Line plots are especially useful if the data that you’re plotting changes exponentially, which can be hard to visualize properly. You can use the function parameters of the library that you’re using to convert your data points to the logarithmic scale to end up with a better looking and easier to interpret graph.

You can also use line plots to plot 2+ different variables using different colors are line styles (dash, dotted etc) to discover if there is a relationship between them.

If you are dealing with multiple variables or categories, it is a really good idea to include a legend in your plot which will allow you to make sense of the data easier.

Above a line plot is used to visualize the change in stock prices with dates on the X axis and closing price values on the Y axis. You can see the company each line represents on the legend on the right side of the plot.

Scatter Plots

Scatter plots are almost always used to discover the relationship between different variables, whether they have any correlation, positive or negative. They are similar to line plots but the points are not connected with straight lines, just plotted on the plane. The two axes represent two different variables.

If most of the points are conforming to a shape that looks like a forward slash /, we can say that the plotted variables have a positive correlation, and vice versa. A negative correlation if they are forming a back slash \. The smaller the distance between the data points, the stronger the correlation. Here’s how they actually look on the scatterplot.

You can even add a third dimension to your scatter plot by changing the sizes of the dots on the plot depending on the frequency of that value, thus also describing the distribution of the data along with the relationship between the variables. Alternatively you can incorporate a categorical variable into your plot by changing the color of the dots based on the category they belong to. Don’t forget to adjust the opacity value so that you can better see the overlapping points if there are any.

There is a caveat though. You should be careful not to overplot. This usually occurs when there are too many datapoints that have values that are really close to each other. You can’t miss it as it looks really ugly and prevents you from seeing the relationship between your variables. Looks something like this.

What you can do if you find yourself in this situation is selecting a subset of the values being careful to get an even distribution. That will also represent the same relationship between the two variables and let you get a better result.

Boxplots / Whisker Plots

Box plots are without a doubt one of the best plotting solutions if you’re looking to pick out the outliers in your data. You may need a little bit of practice to get used to interpreting box plots but once you get used to it, they offer a lot of insight.

What it does is calculating the mean and quartiles of the variable and plot it in a way that let’s you see the distribution and outliers of your data in one plot. Here’s an example.

What the box plot shows you is basically all the data points divided into 4 equal parts. You can easily see the median and the 1st / 3rd quartile values forming the edges of the box plot.

The IQR (Interquartile Range) is an important metric when plotting the box plot and determining which data points are outliers.

  • IQR = (Median Value Of The Upper Quartile Of The Box) — (Median Value Of The Lower Quartile Of The Box) or Q3 — Q1.

And the formula for determining outliers is Q1 — (1.5 x IQR) and Q3 — (1.5 x IQR). Anything that falls below or above these values are considered extreme cases or outliers.

How tall the box is also shows how densely the values of your variable is distributed. The shorter the box, the more closer the data points are.

Violin Plots

Though it is one of the less popular plot types, in my opinion it is really useful as it lets you see both the distribution, how dense the datapoints are for different ranges, and the outliers in the same plot. Be aware though, if you plot the violin with too small of a sample from your data, it might lead you in the wrong direction as it will look smoother than it really is.

Visualization Libraries For Python And Some Examples

Lastly I want to go through the visualization libraries for Python and show you a couple of examples. I am linking a notebook where I have used the tips dataset from Kaggle as an example to showcase some of the plots from different libraries. You can take a look at both the code and the interactive plots here. (You will need to open the notebook in Colab for the interactive Plotly plots.)

The most commonly used visualization library for Python is Matplotlib. It is really comprehensive and let’s you do pretty much all you want to do in terms of different graph types but it has a steeper learning curve compared other visualization libraries. Plotting complex graphs in Matplotlib usually takes more time and more lines of code than it does with other libraries and the results you get are usually not that pretty in terms of looks but it does give you more control over your plots. If you’re looking to use your graphs in a presentation, I’d suggest you don’t work with Matplotlib.

The most popular library for visualization in Python is probably Seaborn. It is built on top of Matplotlib and it’s easier to use because it is higher-level than Matplotlib. You can get more beautiful graphs in less time and with less code. The down side is you are limited in your choice of graph types compared to Matplotlib but most of the times that’s not a problem as the most commonly used graph types are available in Seaborn.

Another popular visualization library is Plotly which is probably the one that let’s you create the best looking plots and they’re interactive too! This is especially great for presentations, letting you zoom in on specific parts of your graphs to take a closer look.

Lastly you can use Altair and Bokeh. I don’t have any experience with them so I’m not in a position to make any comments.

To Sum Up

Thanks for making it all the way to the end! A blog post on visualization should be colorful and entertaining and I tried my best to do that while still providing useful information. As you can see, there are a lot of options in Python that can get your visualization tasks done in different ways and they all have their advantages. If you don’t mind tweaking each little detail to get more control over your plots, you can go with Matplotlib. If you want to get things done fast and want your plots to look pretty, you can go with Seaborn and Plotly. The choice is yours!

If you’ve found this post helpful or spotted a mistake, I’d really appreciate any feedback. You can reach out to me on Twitter or leave a comment on Medium. Thank you for your time and I’ll see you on the next post!

--

--

Kaan Ceylan

Aspiring ML Engineer. Enthusiast of data and everything about it. Doing my best to document my self-learning journey while hoping to help others.