Data Visualization Using Matplotlib & Python For Data Science Aspirant

What Is Data Visualization In Data Science & How To Do It Using matplotlib Python library?

@pramodchandrayan
Predict
11 min readAug 26, 2020

--

source: unsplash

I Feel:

In today’s digital world data has become as important as air. Machines & humans both are literally breathing in & breathing out data, data & data….

People are consuming and generating huge volumes of data knowingly and unknowingly on a daily basis. It is this bombardment of digital information is that current businesses are trying to tap and harness to sell and engage their customers more. All types of Industries are bringing a personal touch into their services and offerings to give awesome user experience to their customers. All these have become possible due to powerful Data science enabled AI/ML techniques that are empowering our machines, allowing them to make analytical decisions based on a sea of data accessible to them.

In order to analyze these huge data sets our machines make use of some really powerful data visualization packages built-in Python. So we will try to capture

1. What Is Data Visualization?

2. What are Data Visualization Packages?

3. How To Use Them?

4. Why Should You Learn Them?

In this series on Data visualization using python which we will break in many parts.

Data Visualization In Data Science:

As we know our human mind is trained to understand more by images. So the saying goes “A picture is worth a thousand words”. This is completely relevant when you are learning Data Science. You will be dealing with a large volume of data sets that need visual expression to make some sense in deducing valuable hidden patterns.

Data visualization is a technique in the data science field, allowing you to tell a compelling story, visualizing data, and findings in an approachable and stimulating way. It makes complex data look simple and easy to understand.

Data Visualization Tools:

We will try to cover some of the popular data visualization tools givens below

  1. Matplotlib
  2. Seaborn
  3. Plotly
  4. Pandas

Learning how to leverage these software tools to visualize data will help you make sense of data, extract meaningful information, and plot it visually to make more effective data-driven decisions.

So let’s get started with Matplotlib which we will cover in today’s piece of article, rest we will cover in an upcoming series of Data visualization.

A: Matplotlib:

As per official MatplotlibPortal:

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

Matplotlib is a widely used tool for data visualization, which works great at low-level with a Matlab like GUI interface and offers you a lot of flexibility in terms of writing code, yes, it can be sometimes tedious writing more codes but it is worth with the kind of freedom it gives.

Installing Matplotlib:

  1. Using PIP:

python -m pip install -U pip

python -m pip install -U matplotlib

2. Using Scientific Python Distribution :

There are many third-party scientific distributions like

Anaconda is my personal favorite, it is one of the popular python data science distributions which gives you a hassle-free installation of all data science related packages and comes pre-loaded with Numpy, SciPy, Pandas, Matplotlib, Plotly, etc. I would recommend you all to install it and you will be all set in a few seconds.

You can install any package using Conda command prompt/terminal by using conda terminal, though you need to visit the package official site, to get the exact command format.

  • conda install PackageName

For Matplotlib:

  • conda install matplotlib

Various types of data visualization Matplotlib provides are :

  1. Lines, bars and markers
  2. Images, contours & fields
  3. Pie & polar charts
  4. Statistical level Plotting

& many more.

They are widely used for line charts, bar charts, histograms, pie-chart, etc..

For detail visit gallery section by clicking on the link below gallery — Matplotlib 3.1.0 documentation

This gallery contains examples of the many things you can do with Matplotlib. Click on any image to see the full image…matplotlib.org

Plotting With MatplotLib: Let’s Learn By Examples:

As discussed, Matplotlib facilitates various kinds of plot ranging from scatter plots, to bar charts, to the histogram. The selection is totally contextual and is made based on our data visualization requirements like group comparison, comparing two quantitative variables to each other, or to understand data distribution, etc.

We will cover a few popular plotting techniques here:

Basic Requirements :

Before we start getting our hands dirty with some real examples, we need to be ready with few installations :

Install Anaconda Distribution:

1. First, you need to ensure anaconda is installed :

Use the given link below to learn the installation process. It is easy and you can get started in a few seconds: Installation — Anaconda 2.0 documentation

On Windows, macOS, and Linux, it is best to install Anaconda for the local user, which does not require administrator…docs.anaconda.com

Launch Jupyter Notebook:

Once you are done with the installation of anaconda distribution, open the anaconda navigator on your computer and launch Jupyter notebook as shown in the image below. We will be using the Jupyter notebook to code our examples.

Check for the Prerequisite Package Installation:

Refer to the below-given image: Go to Environments menu option and you will see various pre-installed packages on the right. For eg. Search for Pandas and you will see that it is pre-installed, similarly, you can type in the required package and discover them to install if not already installed though Anaconda Navigator. Check and ensure Matplotlib, NumPy, pandas, seaborn, etc are pre-installed and install them if it is not installed.

Once you are done with required package installation, let’s get started with our first plot called Bar Chart:

Some Key Points About Matplotlibs To Be Remembered:

Matplotlib has an important module named PyPlot, which aids in plotting figures. The Jupyter notebook can be used for running the plots, it gives hassle-free experience and is easy to get started. We have to import matplotlib.pyplot as plt for making it call the package module.

  • You can Import required libraries and dataset to plot using Pandas pd.read_csv()
  • Use plt.plot()for plotting line chart similarly in place of plot other functions are used for plotting. All plotting functions require data and it is provided in the function through parameters.
  • Use plot.xlabel , plt.ylabel for labeling x and y-axis respectively.
  • Use plt.xticks , plt.yticks for labeling x and y-axis observation tick points respectively.
  • Use plt.legend() for signifying the observation variables.
  • Use plt.title() for setting the title of the plot.
  • Use plot.show() for displaying the plot.

1. Bar Chart Plotting:

Bar Plotting Example :

#Here we import the matplotlib package with alias name as plt

Copy the above code and paste it in your Jupyter notebook, run it and you will be able to see the bar plot visuals as shown below:

Explanation:

After we import matplotlib data visualization package its submodule pyplot has got this bar method which helps you plot a basic bar graph ;

Here plt. bar method can be better understood by the explanation given below.

So to make a bar plot:

The bars are positioned at x with the given alignment. Their dimensions are given by width and height. The vertical baseline is bottom(default 0).

Each of x, height, width, and bottom may either be a scalar applying to all bars, or it may be a sequence of length N providing a separate value for each bar.

For more detail visit:matplotlib.pyplot.bar — Matplotlib 3.1.0 documentation

The optional arguments color, edgecolor, linewidth, xerr, and yerr can be either scalars or sequences of length equal…matplotlib.org

2. Histogram:

A histogram is a plot of the frequency distribution of a numeric array by splitting it into small equal-sized bins.Histograms are used to estimate the distribution of the data, with the frequency of values assigned to a value range called a bin.

If you want to mathematically split a given array to bins and frequencies, use the numpy’s histogram() method . If you want to measure distribution of numeric values you can do so with .hist() plot method to create a simple histogram

Matplotlib provides the functionality to visualize Python histograms out of the box with a versatile wrapper around NumPy’s histogram():

Example:

Explanation:

The pyplot.hist() in matplotlib lets you draw the histogram. It requires the array as the required input and you can specify the number of bins needed. A plot of a histogram uses its bin edges on the x-axis and the corresponding frequencies on the y-axis. In the chart above, passing bins=’auto’ chooses between two algorithms to estimate the “ideal” number of bins. At a high level, the goal of the algorithm is to choose a bin width that generates the most faithful representation of the data.

The output of source code: #Histogram Code mentioned above:

3. Scatter Plot:

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

A scatter plot can suggest various kinds of correlations between variables with a certain confidence interval. For example, weight and height, the weight would be on the y-axis, and height would be on the x-axis. Correlations may be positive (rising), negative (falling), or null (uncorrelated). If the pattern of dots slopes from lower left to upper right, it indicates a positive correlation between the variables being studied. If the pattern of dots slopes from the upper left to lower right, it indicates a negative correlation.

Scatter plot Method format :

x, y : array_like, shape (n, )

The data positions.

s : scalar or array_like, shape (n, ), optional

The marker size in points**2. Default is rcParams[‘lines.markersize’] ** 2.

c : color, sequence, or sequence of color, optional

For more detail about using the scatter plot method please refer the given link below:matplotlib.pyplot.scatter — Matplotlib 3.1.0 documentation

Edit descriptionmatplotlib.org

Scatter Plot Example:

Compile the code on your Jupyter notebook and you will see the outcome as given below:

Understanding Data Visualization Through Real Data Sets :

We will be using the automobile data set, which we have downloaded from Kaggle, to understand data visualization using MatplotLib: Automobile Dataset

The dataset consists of various characters of an auto www.kaggle.com

Always Remember:

  1. Download the Automobile.csv file from the above link
  2. Upload the file Jupyter into your working directory where your current code files lie.
  3. Plotting Histogram: Using grouping data categorically :

We can have multiple histogram plots in the same plot. This helps you to compare the distribution of a continuous variable grouped by different categories.

To understand it, we will be using Automobile.csv data sets:

Reading Data Sets:

When you compile this code you will see the below-given o/p as a series of data column-wise.

Let’s compare the distribution of car horsepower for a different type of car make in above-given data set of Automobile.csv

Write/Copy-paste below given code in your Jupyter notebook file:

Below is a histogram Plot plotted against the given set of values using

You can clearly make out that the larger concentration of horsepower lies between 110–120 hp.

Scatter Plot :

Let’s plot a data distribution using a scatter plot. Here we will try to see price distribution based on body_style of car.

Copy /Paste the below-given code in your Jupyter notebook and compile it

Output:

Observation :

You can see that there is a lot of data density around sedan type car and price mostly falls in the budget range of $10K to $15K. The second most used car body type comes out to be a hatchback. Wagon type mostly falls in the low-budget range.

What’s Next :

There are more plots which we have not covered yet, like:

  1. Violin plot
  2. Stacked plot
  3. Stem Plot
  4. Line Plot
  5. Box Plot

Which we will cover in part 2 of this series on Data Visualization. Also, we will cover

“Data visualization using Seaborn package in detail “

When to Use What Type Of Data Visualization Plots/Charts?

Leaving you all, with this wonderful pictorial representation of a data visualization graph type, which explains what type of graphs you can choose based on your data analysis requirements:

Summing Up:

It is absolutely recommended to add Data Science understanding for all software engineers who want to take advantage of all the amazing opportunities this field of data engineering is poised to offer. With data engineering augmented with AI/ML techniques, you can really grow fast and become an instrument of change for your organization or your own startup.

The Future will be all about data analysis, data prediction, product recommendations, and process automation, all these will require a lot of data engineers who can help organizations to make accurate, fast, and intelligent decisions regarding services and product offerings.

--

--

@pramodchandrayan
Predict

Building @krishaq: an Agritech startup committed to revive farming, farmers and our ecology | Writes often about agriculture, climate change & technology