For me, Data Science has always meant four things, i.e Data Analytics, Statistics, Machine Learning and Data Visualization(a part of analytics only though).
But not often, almost always visualization is ignored by beginners. So this time a quick go through on how to master data visualization to be the ultimate data magician.
8 Skills You Need to Become a Data Scientist | Data Driven Investor
Numbers do not scare you? There is nothing more satisfying than a beautiful excel sheet? You speak several languages…
Before beginning , lets understand why we need visualization?
It can provide you some great help in:
- Interpreting data better and memorable.
- Noticing correlations
- Figuring outliers
- Feature Engineering
- Cause-Effect relations
And many more stuff!!
So I guess now we have at least a reason why we need Data Visualization in our inventory.
So, from where we should start?
If you are picking up any online course in python, you will probably be meeting matplotlib at first and would gradually upgrade to seaborn
Some of you might see Bokeh as well.
But to be honest, Tableau is just unbeatable!!! Trust me on this. I would be starting with Tableau in my next!!
What are the things we should keep in mind while visualizing stuff?
I thought of writing some points, then found this.
JUST BANG ON!!
For now, let's explore some plots and their significance.
1. Bar Plots
It is amongst the most popular plots we often encounter. It is used to compare numerical data over some categories/groups.
Example: If we need to compare the number of students passed in different subjects, we might need barplot. In the below image, y-axis can be taken as Marks and x-axis can be considered as Subjects (A,B,C,D,E)
2. Grouped bar Plots.
It can be the case that we need to compare Marks and Attendance(numerical) together across Subjects. Hence grouped bar plots can be used. Here multiple numerical data columns can be compared against groups.
— — —
— — —
— — —
3. Stacked Bar Plots
It might be the case that you want to compare Marks for different sections(suppose A,B,C) and subjects, you have stacked bar plots. Here , we can plot numerical data against groups and subgroups. For us, groups: sections & subgroups: subjects
or vice versa
— — —
— — —
Histogram can be considered as plotting a binned version of a variable.
What is binning?
Binning is a way to group a number of more or less continuous values into a smaller number of “bins/range”. For example, if you have data for about a group of people, you might want to arrange their ages in intervals as done above.
It should be used when we need to find the frequency in particular intervals/range of a column.
But it looks like bar plot only!!
Yes, though they both look similar, some major difference exists
5. Pie Chart
It is again to compare numerical data against a category just like a bar plot but with a difference. It helps us to compare data as a fraction of the whole (percentages rather than raw numbers).In our example, it can be used when we need to find the percentage of students passed in certain subjects and not only numbers
6. Box & Whiskers Plot
Box plots provide a lot of information about any numerical data column. Its main purpose is to give an idea/summary of the distribution of the data.
Lower Quartile refers to 25 percentile while Upper Quartile refers 75 percentile. When we need a summary of the Histogram, we will be using the Box Whisker plot.
— — —
7. Scatter Plots
Don’t these plot look similar to something from high school!!
It is basically an X,Y coordinate plots i.e. between two numerical data columns which can be helpful to track down the regression line.
— — —
8. Bubble Chart
In the above example, we were plotting X, Y coordinates (Sales and Temp). What if a Z also comes up!! like Sales, Temp & production units. Here comes the Bubble Chart. It is a variant of the Scatter plot where the 3rd dimension is represented using bubble size.
— — —
— — —
— — —
9. Strip Plot
We have just observed scatter plots. They are just X, Y coordinate plots where both X, Y are continuous.
But what if we want to have one of them as categorical.
Strip plots help us plot scatter plots between a numerical and categorical data column. To avoid overlapping, small jitter may be added to the numerical data so as to avoid it.
10. Swarm Plot
Swarm plots are also amongst the variants of scatter plot. They are quite similar to strip plot but with a difference. Here, to avoid overlap between points, the points are placed horizontally next to each other. It is mostly used for plotting categorical vs numerical
— — —
— — —
11. Violin Plots
Amongst the most used scatter plot variants, it helps to plot categorical vs numerical data as strip & swarm plot do. But!!
It is basically a combination of box and whiskers plots(see the center) and Kernel Density Estimation plots of the variable.
what the heck is Kernel Density Estimation?
To understand KDE, this is the best resource you can find on the internet.
Treemaps are an alternative way of visualizing the hierarchical structure of a tree structure (groups & subgroups)while also displaying quantities for each category via area size. Each category is assigned a rectangle area with its subcategory rectangles nested inside of it.
HeatMaps can be considered as a close variant of Bar Plot with a stark difference.
Inside of using bars, it uses colors to signify the quantity used.Also, it provides a scale of what color represents which quantity.
Do choose to use the color spectrum wisely that is relevant and not any random set.
14. WaterFall plot
Now here comes one of the most unique plots for cumulative data visualization.It helps us to visualize the cumulative numerical data column over categorical data.
This can be very helpful for data dealing with time series.
The adjacent plot shows the Net Cash Flow(numerical) cumulative values over months(categorical, time-related)
15. Gantt plot
Again a very useful and unique visualizer, it aims at plotting start and end time for certain events(hence 1st love for management people).
In the adjacent plot it can be observed that how an entire project and its timeline can be visualized using gantt plots.
I guess this much will be enough for day 1 of visualization.
Still not enough, find your meal below!!!
- Accessing Image quality using Google NIMA
- CNN using Tensorflow
- Binary Classification using Tensorflow (Titanic)
- Object Detection using YOLO
- Tensorflow for beginners (concepts + Examples)
- Reducing Your TensorFlow model size by x10 !!!
- Starting off with Time Series
- Preprocessing Time Series (With codes)
- Kaggle for beginners
- Data Analytics for beginners
- Statistics for beginners
- Q Learning in Reinforcement Learning