Simple Guide to Data Visualization

AI+ Port Harcourt
The Startup
Published in
7 min readSep 13, 2020

How do you get information across through visualization?

Photo by Mihály Köles on Unsplash

Data visualization is the graphic representation of data and information. Data visualization makes use of charts, graphs, software or other visualization tools to provide a quick overview of data and show trends and relationships that exist.

Data visualization is not just about plotting charts or making colorful images, in data visualization, the goal is to pass information to the end users as well as:

  • Visualize trend in dataset
  • Easily recognize outliers
  • Recognize data patterns
  • Understand relationship between data

Data Visualization for Data Scientists

Data science has found application in various industries leading to the employment of data scientists, analysts and engineers across industries in varying capacities. This is why it is good to have an understanding of chart types for visualizations.

In finance, line charts are very useful as they show trends in prices and can also be used to forecast future trends. Gantt charts are used to keep track of project and event start and end times and have found useful application in project management.

In data science, Exploratory Data Analysis is an important process that often employs visual methods to summarize data in order to gain insight from data (the whole data science process is aimed at gaining insight from data).

Data visualization can be carried out via visualization tools as they are commonly called. These tools include websites and services that offer visualization, visualization software and visualization libraries in programming.

Data visualization tools include:

  1. Power BI
  2. Tableau
  3. Adaptive Insights
  4. Plotly
  5. Google Charts

The above visualization tools do not require any coding experience.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of summarizing the important aspects of data often using visualization. EDA is an important part of the Data Science process. EDA seeks to check assumptions and test hypothesis as well as discover patterns/trends in the data.

Popular programming languages for data visualization include:

  1. Python
  2. R
  3. MATLAB

While popular, visualization libraries include:

  • Matplotlib
  • Seaborn
  • Bokeh
  • ggplot

I recall having to make a presentation during my internship, nothing difficult, my tools were simply Excel and PowerPoint. Before then I was familiar with both Excel and PowerPoint and I did not feel a presentation would be difficult. However, when I submitted my ‘finished work’ to my supervisor he made so many corrections I felt beat. My good work was not so good after all. Let me tell you some of the mistakes I made:

I did not label my charts properly

Rookie move, I know. Me of today, cannot believe I did that and I will simply rule it as an omission.

Chart proportion did not match the presentation.

Let me explain, you see, this was a presentation, I was going to use a projector and my charts were too big. I did not know that was even possible.

I had to learn how to proportion my charts with the rest of my presentation so that they look good.

Looks do matter when it comes to visualization because the eyes perceive the data before our mind even starts to process it. Now if you are carrying out Exploratory Data Analysis in a notebook environment that will never be published on any platform, you might not give this much thought. Also, notebooks are easier to navigate and often times, with little effort, your visualizations will look good.

I used wrong plots to represent data.

My data type was nominal— categorical — a plot of the quantity of some items (can’t remember what exactly), and I had used a line plot.

In my defense, I thought it looked really good and the colors, splendid. When my supervisor looked at it, he asked me what I was plotting, first strike. The whole point of visualizing data is to give a quick, easy to understand overview of the data. Recall I did not even label my charts so I can only imagine my supervisor trying to make sense of it all.

If your visualization confuses the audience, you should probably — definitely — rethink your charts and plots.

Back to the story, when I told him what I wanted to plot, he simply turned it into a bar chart. He explained that to visualize quantity of items, bar charts are better for the job. Line charts should be used if you want to show relationship. You’ll see what I mean soon enough.

Some Important Lessons I learned:

Bigger is not better

Your plot should be easy to understand

Label plots properly

Understand chart relationships

Take your data type into consideration when choosing charts

The presentation medium also matters. Graphic designers are familiar with the RGB and CMYK color schemes. The RGB is the color we get from our screen, while CMYK is print color. Graphic designers often have to convert their work from RGB to CMYK so that they get a good look of how their work will look when in print.

With advancements in technology, this process may not be necessary and because most presentations are made on screens, you may not never have to worry about this, especially as a Data scientist.

Common Charts

Charts give quick summary of data and can be used to show the relationship between variables or features. Charts have two axes, the x-axis and the y-axis. The x-axis (horizontal) represents the independent variable while the y-axis (vertical) represents the dependent variable.

Below are some common charts, their application and information that can be derived from such charts. We will be using Python and Matplotlib through out for consistency. However, I advise you to explore other visualization tools and libraries.

I also make use of the Titanic dataset. I refer to this data set frequently because it is very popular especially amongst beginners and can be accessed via Kaggle.

Bar Chart

The bar chart is used to represent categorical data. On one axis, we plot the nominal / discreet variables, while on the other axis, we have the dependent quantitative variables. The quantitative values may be discreet or continuous. The heights of the bar charts are proportional to their values. Bar charts may be horizontal or vertical (column charts), placed side-by-side or stacked.

The following information can be obtained from bar charts:

  • Highest and lowest values in a dataset
  • A type of bar chart, the Histogram, can also be used to display the distribution of data (normal, binomial etc)
  • The count or measure of each category
  • Compare categorical variables

LINE CHART

The line chart is used to visualize the relationship between quantitative — independent and dependent — variables. With the line chart, it is easy to spot changes in data over time. The data points are represented by markers connected by lines or curves. The direction of the plot also denotes the type of relationship between the variables.

The Line Chart can be used to:

  • Display history trend of data
  • Forecast future trend
  • Track changes that occur over a period

Line Plot with a lot of data can be quite messy so take note of that. However, line plots are very useful tools for time-series data which is why it is useful in viewing stock prices and forex.

SCATTER PLOT

The scatter plot is similar to the line plot, without the lines. The scatter plot is a plot of two independent quantitative variables. The advantage of the scatter plot is that clusters can easily be identified.

Other information that can be obtained from scatter plots include:

  • Relationship between the numerical variables

Variables may be positively correlated, negatively correlated or not correlated at all. Variables can also be strongly or weakly correlated depending on how close the data points lie to the line.

  • Delineates clusters
  • Easily spot outliers

GANTT CHART

The Gantt chart is a type of bar chart — Horizontal bar chart. Nominal data is plotted on the y-axis and quantitative data on its x-axis. The position and length of the Gantt chart are important features as Gantt charts are often used to show stages of projects and relationships between activities. It is therefore an important tool for project management.

From Gantt chart we are able to identify:

  • Start and stop time of activity
  • Overlapping activities
  • Order of activities

PIE CHARTS

Pie charts are circular charts used to illustrate numerical proportions of categorical variables. The pie chart proportions can be represented in degrees totaling up to 360 degrees or in percentage up to 100% or decimals up to 1. Each slice of the pie represents a part of a whole.

Pie charts are useful for:

  • Comparison
  • Showing compositions of a whole
  • Proportions of classes
  • Quick summary of data

HISTOGRAMS

Histograms are bar charts where the values are grouped into ranges called bins. A tall bar indicates that many values fall within that range while a short bar indicates the opposite. Histograms are especially useful for displaying distribution of the data — normal, binomial, Poisson etc.

Histograms are used to:

  • Identify highest and lowest ranges
  • Used for statistical analysis (mean, median, mode etc.)
  • Denotes distribution of dataset

Conclusion

Understanding the characteristics of charts and their application will help you make better visualizations. Knowing what to plot and how to make your plots, will enable you gain valuable insight from data.

While I focused on Matplotlib, there are other visualization libraries that can help you make simple plots.

Remember, your visualization should be clean, clutter-free and suitable for the data type you are visualizing. Don’t forget to label those charts!

Reference

Written by Anita Igbine

--

--

AI+ Port Harcourt
The Startup

Data Science Nigeria PH community. Data Science and Artificial Intelligence