Data Visualization For Data Science Beginners

Mehul Gupta
Data Science in your pocket
6 min readOct 18, 2019

--

For me, Data Science has always meant four things, i.e Data Analytics, Statistics, Machine Learning and Data Visualization(a part of analytics only though).

But not often, almost always visualization is ignored by beginners. So this time a quick go through on how to master data visualization to be the ultimate data magician.

Before beginning , lets understand why we need visualization?

It can provide you some great help in:

  • Interpreting data better and memorable.
  • Noticing correlations
  • Figuring outliers
  • Feature Engineering
  • Cause-Effect relations

And many more stuff!!

So I guess now we have at least a reason why we need Data Visualization in our inventory.

So, from where we should start?

If you are picking up any online course in python, you will probably be meeting matplotlib at first and would gradually upgrade to seaborn

Some of you might see Bokeh as well.

But to be honest, Tableau is just unbeatable!!! Trust me on this. I would be starting with Tableau in my next!!

What are the things we should keep in mind while visualizing stuff?

I thought of writing some points, then found this.

JUST BANG ON!!

For now, let's explore some plots and their significance.

1. Bar Plots

It is amongst the most popular plots we often encounter. It is used to compare numerical data over some categories/groups.

Example: If we need to compare the number of students who passed in different subjects, we might need a barplot. In the below image, the y-axis can be taken as Marks and the x-axis can be considered as Subjects (A, B, C, D, E)

2. Grouped bar Plots.

It can be the case that we need to compare Marks and Attendance(numerical) together across Subjects. Hence grouped bar plots can be used. Here multiple numerical data columns can be compared against groups.

3. Stacked Bar Plots

It might be the case that you want to compare Marks for different sections(suppose A, B, C) and subjects, you have stacked bar plots. Here, we can plot numerical data against groups and subgroups. For us, groups: sections & subgroups: subjects or vice versa

4. Histogram

Histograms can be considered as plotting a binned version of a variable.

What is binning?

Binning is a way to group a number of more or less continuous values into a smaller number of “bins/range”. For example, if you have data for about a group of people, you might want to arrange their ages in intervals as done above.

It should be used when we need to find the frequency in particular intervals/ranges of a column.

But it looks like bar plot only!!

Yes, though they both look similar, some major difference exists

5. Pie Chart

It is again to compare numerical data against a category just like a bar plot but with a difference. It helps us to compare data as a fraction of the whole (percentages rather than raw numbers).In our example, it can be used when we need to find the percentage of students who passed in certain subjects and not only numbers

6. Box & Whiskers Plot

Box plots provide a lot of information about any numerical data column. Its main purpose is to give an idea/summary of the distribution of the data.

The lower Quartile refers to the 25 percentile while Upper Quartile refers 75 percentile. When we need a summary of the Histogram, we will be using the Box Whisker plot.

7. Scatter Plots

Don’t these plots look similar to something from high school!!

It is basically an X, Y coordinate plot i.e. between two numerical data columns which can be helpful to track down the regression line.

8. Bubble Chart

In the above example, we were plotting X, and Y coordinates (Sales and Temp). What if a Z also comes up!! like Sales, Temp & production units. Here comes the Bubble Chart. It is a variant of the Scatter plot where the 3rd dimension is represented using bubble size.

9. Strip Plot

We have just observed scatter plots. They are just X and Y coordinate plots where both X and Y are continuous.

But what if we want to have one of them categorical?

Strip plots help us plot to scatter plots between a numerical and categorical data column. To avoid overlapping, small jitter may be added to the numerical data so as to avoid it.

10. Swarm Plot

Swarm plots are also among the variants of scatter plots. They are quite similar to the strip plots but with a difference. Here, to avoid overlap between points, the points are placed horizontally next to each other. It is mostly used for plotting categorical vs numerical

11. Violin Plots

Amongst the most used scatter plot variants, it helps to plot categorical vs numerical data as strip & swarm plots do. But!!

It is basically a combination of box and whiskers plots(see the center) and Kernel Density Estimation plots of the variable.

what the heck is Kernel Density Estimation?

To understand KDE, this is the best resource you can find on the internet.

12. TreeMap

Treemaps are an alternative way of visualizing the hierarchical structure of a tree structure (groups & subgroups)while also displaying quantities for each category via area size. Each category is assigned a rectangle area with its subcategory rectangles nested inside of it.

13. HeatMap

HeatMaps can be considered as a close variant of Bar Plot with a stark difference.

Inside of using bars, it uses colors to signify the quantity used. Also, it provides a scale of what color represents which quantity.

Do choose to use the color spectrum wisely that is relevant and not any random set.

14. WaterFall plot

Now here comes one of the most unique plots for cumulative data visualization. It helps us to visualize the cumulative numerical data column over categorical data.

This can be very helpful for data dealing with time series.

The adjacent plot shows the Net Cash Flow(numerical) cumulative values over months(categorical, time-related)

15. Gantt plot

Again a very useful and unique visualizer, it aims at plotting start and end times for certain events(hence 1st love for management people).

In the adjacent plot, it can be observed how an entire project and its timeline can be visualized using Gantt plots.

I guess this much will be enough for day 1 of visualization.

--

--