Data Analysis With Pandas

Arvind Kale
4 min readNov 4, 2022

--

data analysis

What is Data Analysis ?

Data analysis is the technique of modifying, processing, and filtering raw data in order to obtain useful, pertinent information that supports commercial decision-making. The process offers helpful insights and statistics, frequently presented in charts, graphics, tables, and graphs, which lessen the risks associated with decision-making.

Every time we make a decision in our daily lives, we can observe a simple illustration of data analysis by assessing what has happened in the past or what will happen if we take that action. In its simplest form, this process involves looking at the past or future and making a choice based on that analysis.

Types of Data Analytics

Data analytics is broken down into four basic types.

  1. Descriptive analytics: This describes what has happened over a given period of time. Have the number of views gone up? Are sales stronger this month than last?
  2. Diagnostic analytics: This focuses more on why something happened. This involves more diverse data inputs and a bit of hypothesizing. Did the weather affect beer sales? Did that latest marketing campaign impact sales?
  3. Predictive analytics: This moves to what is likely going to happen in the near term. What happened to sales the last time we had a hot summer? How many weather models predict a hot summer this year?
  4. Prescriptive analytics: This suggests a course of action. If the likelihood of a hot summer is measured as an average of these five weather models is above 58%, we should add an evening shift to the brewery and rent an additional tank to increase output.

Types of Analysis :

We encounter two different sorts of variables when analyzing data: categorical variables and continuous variables.

A categorical variable is one whose values may be counted and which can take either dichotomous or a range of values, for example, whether a car has a 4 or 2 stroke engine. There is a narrow range of responses to these kinds of queries.

Values for continuous variables, such as the cost of a house or the number of students in a class, are countable infinity or have a continuous range.

We have three sorts of analysis based on the aforementioned characteristics and the quantity of variables we use:

  1. Univariate Analysis
  2. Bivariate Analysis
  3. Multivariate Analysis

Univariate Analysis:

In univariate analysis what we do is take single feature from the data frame and analyze it with help of charts but there is one catch you cannot apply same charts to continuous and categorical variables

for continuous variables always plot histogram or kde plot for categorical variables you should plot bar chart or value count.

In the code snippet you can see i have done value counts on the country column which is a categorial column if it had been the continuous variable I would have preferred the histogram or kde plot .

Bivariate Analysis:

In bivariate as the name says we take two columns from the data frame and considering the continuous and categorical columns there are three sub types to it

  1. Categorical vs Numerical: In Categorical vs Numerical type of dataframe columns what we do is either plot box plot or bar plot. In code snippet you an see there were two columns no. of bookings which is numerical column while market segment is categorical column I have plotted the bar chart after doing the pivot table a function in pandas.

2.Numerical Vs Numerical : In Numerical vs Numerical column you should use scatter plot or line chart as it is more convenient and suitable for these type data.

3. Categorical Vs Categorical : In categorical vs Categorical variables we can use crosstab function in pandas or groupby on any one column and use value count as aggregated function .

Important Pandas Functions:

  1. pd.read_csv or read_excel: the important function to read the csv or excel file from source
  2. df.head(10) or df.tail(10) : This function gives us the top 10 or bottom 10 rows of the data to see the dataframe
  3. df.isnull().sum() : to see if there are any null values in the data
  4. fillna: to fill null values we can specify three methods like mean. mode or median
  5. df.shape() : to check the shape of data
  6. df.describe : this function gives us the min , max, standard deviation and percentiles which is statistics of the categorical columns at one place
  7. df.info : It gives us the information about the data type info of the columns

--

--