Basic tools to learn in Data Analysis with Python

Abhishek Jaiswal
Analytics Vidhya
Published in
4 min readAug 11, 2020

--

Photo by Markus Spiske on Unsplash

Harvard University labeled the profession “the sexiest job of the 21st century.” And according to LinkedIn, the career has seen an exponential growth becoming the second-fastest-growing profession. This is the time where anyone can start his/her career in the Data field.

Let’s Know the Basic tools Everyone should know when starting with Data Analysis with Python

Step 1: Getting Started(Data Extraction)

You can use Jupyter Notebook (https://jupyter.org/) directly. It’s a free and open-source web application you can work on.

1.Import your Python Library into your Notebook via

Here, NumPy contains a multi-dimensional array and matrix data structures. It can be utilized to perform several mathematical operations on arrays such as trigonometric, statistical, and algebraic routines and Pandas is used for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series, and Matplotlib creates a figure, a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, and many more.

2.Import your Data into the Notebook via

Here using the Pandas library, we import data set into the notebook and save it to the variable sales and parse data is used to match time according to the system.

Output(Fig. 1)

Step 2: Data Exploration

  1. sales.head() -Extract the five Rows form the sales data list.
Fig. 2

2.sales.shape -Give us the (Rows * Column) detail of the sales data list.

Fig. 3

3.sales.info() -Give us the detail of the columns,non-Null rows, counts, and data types.

Fig. 4

4.sales.describe() -Describe the Various Mathematical functions for column like Average,Standard deviation ,Maximum ,Minimun, 25,50,75 percentile of the column.

Fig. 5

Step 3 : Particular Column Data Analysis

  1. sales[‘Unit_Cost’].describe() -Descrie to one particular column(here unit_cost) from the sales data set.
Fig. 6

2.sales[‘Unit_cost’].mean() -Mean of one particular column(here unit_cost) from the sales data set.

Fig. 7

3.sales[‘Unit_cost’].median() -Median of one particular column(here unit_cost) from the sales data set.

Fig. 8

You can find many more Dataframes sets from https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

Step 4: Data Visualisation

  1. sales[‘Unit_Cost’].plot(kind=’box’, vert=False, figsize=(14,6)) -Uses Matplotlib with pandas on unit_cost column in sales data set and dimension=14,6 with box as a Visualisation type.
Fig. 9

2.sales[‘Unit_Cost’].plot(kind=’density’, figsize=(14,6)) — Uses the same data as above with density as Visualisation type.

Fig. 10

3.ax = sales[‘Unit_Cost’].plot(kind=’density’, figsize=(14,6))
ax.axvline(sales[‘Unit_Cost’].mean(), color=’red’)
ax.axvline(sales[‘Unit_Cost’].median(), color=’green’) -
Uses the same data as above with axvline denoting the vertical line of mean with red colour and median with blue.

Fig. 11

4.ax = sales[‘Unit_Cost’].plot(kind=’hist’, figsize=(14,6))
ax.set_ylabel(‘Number of Sales’)
ax.set_xlabel(‘dollars’) -
Uses same data as above with histogram as visualisation type and Y axis is labelled with Number of Sales and X axis as Dollars

Fig. 12

Step 5: Correlations

  1. corr = sales.corr() -Finds correlations between Columns and rows of the sales data set.
Fig. 13

Step 6: Column Modification

  1. sales[‘Revenue_per_Age’] = sales[‘Revenue’] / sales[‘Customer_Age’] -Creates new column with revenue_per_age by applying following calculations.
Fig. 14

2.sales.loc[sales[‘State’] == ‘Kentucky’] -Get all the sales made in the state of Kentucky.

Fig. 15

3.sales.loc[(sales[‘Age_Group’] == ‘Adults (35–64)’) & (sales[‘Country’] == ‘United States’), ‘Revenue’].mean() -Get the mean revenue of the sales group adult(35–64) in U.S

Fig. 16

You can refer to the dataset used in this project here: https://drive.google.com/file/d/1dLIF5UrKR_gI0WznhTb5uTWaKOmcoOZL/view?usp=sharing

For a full API reference of the Pandas,visit : https://pandas.pydata.org/pandas-docs/stable/reference/index.html#api

For a full API reference of the Numpy, visit: https://numpy.org/doc/stable/reference/

For a full API reference of the Matplotlib ,visit :https://matplotlib.org/3.1.1/api/index.html

--

--