Data Analysis Tools for Python

Aryan Kenchappagol
Analytics Vidhya
Published in
4 min readMar 21, 2021

Exploratory Data Analysis is often used to uncover various patterns present in your data and to draw conclusions from it. EDA is the core part when it comes to developing a Machine Learning model. This takes place through analysis and visualization of the data which will be fed to the Machine Learning Model. In this blog, we will see on how to get started with Exploratory Data Analysis for Machine learning with Python.

As a beginner I struggled a lot while exploring and understanding different patterns present in the dataset, also the Exploratory Data Analysis part cannot be skipped while you are creating a Machine Learning model, because EDA is the root of your ML model generation process. Without proper insights and analysis of different patterns present in your data, you cannot go forward on creating your Machine Learning Model. In this blog we will see the Tools and its functionalities used for EDA. In the upcoming blogs, I will make a detailed approach on EDA through an example.

First things first, the perfect pack for starting with EDA is a set of python modules specially used for Data Analysis and Visualizations. These include :

import pandas as pd

Pandas helps in manipulating the data in tables and series respectively. Once you start importing datasets and work on them you will notice how important pandas actually is. Not necessary to know literally all the functions present in this library but yeah a few will help in manipulating data a lot.

import numpy as np

When it comes to using numpy in your code, you can easily automate your data related to multidimensional arrays. Numpy helps a a lot in Deep Learning where images are to be transformed in a specific manner in order to set the training and testing/evaluation data. Numpy is helpful in both Data Science and Deep Learning.

import matplotlib.pyplot as plt

Matplotlib helps in plotting your data vividly in order to understand the patterns present in your dataset. It is impossible to draw conclusions of your data without visualizing it. So using Matplotlib, try creating bar-plots, histograms, scatter plots, etc. This will help you a lot when it comes to data modelling and feature engineering in Machine Learning.

import seaborn as sns

Seaborn is similar to Matplotlib, the only difference is that the former one has much more cooler parameters present in the respective functions. Again the main idea behind Seaborn and Matplotlib is to identify many patterns present in your raw data and to eliminate the unrequired data from your dataset and feed it to your model. This helps in improving model accuracy and knowing the data more clearly.

Data Analysis is nothing without data. So in order to get decent in this domain, the only key is to use all these libraries together in your code. This will help in understanding the required and necessary functions of these modules and proper use of them at needed times in the code. I hope this helps.

Initial take on this would be to first get familiar with the basic functions of these libraries and try out different approaches by yourself on datasets.

So as I mentioned earlier, Pandas and Numpy both these libraries are used for manipulation of the data as well its analysis, so we get a brief view of the data quickly.

Example:

If suppose I had to create a new dataset of two features — f_1 and f_2 (assuming both of these are of type DataFrame)for some visualization then we use the concat method present in the pandas library.

import pandas as pd
pd.concat([f_1, f_2], axis=1)

So basically pandas, numpy or any other tools can be integrated for visualization using Matplotlib and Seaborn libraries.

Visualization of the data is done using Matplotlib and Seaborn. Through these modules we first find out the pairplot for our dataset.

This pairplot generally helps in identifying the nature of relation present in features and the target of the dataset. Also, pairplot is being followed by —

  • Heatmaps,
  • Histograms
  • Barplots
  • Piecharts

So to summarize, EDA is actually a huge task generally involving multiple functions of various modules integrated along with each other for drawing out conclusions and patterns present in our dataset. Though EDA might seem difficult but it is actually quite fun. Also I wanted to keep this blog short and introductory and not include any functions to get directly stared with Analysis and Visualization. So in the upcoming blog I will be making a brief approach to EDA with an example which includes precise visualizations for understanding the data.

--

--