Data science-Statistic Analysis(part 2)

Kmshilpamurali
Analytics Vidhya
Published in
8 min readJul 14, 2020

It is much easier to visualize data if you know its types and measurement level for that check out my previous blog.

Let’s start to the visualization process first we are going to visualize the categorical variable, they are 4 common ways

  1. Frequency distribution tables
  2. Bar charts
  3. Pareto diagram

Frequency distribution tables :

It has 2 columns one is the category itself and the corresponding frequency. Imagine you own a car shop and sell only German cars. The table above shows the categories of cars and their frequency i.e No of unit sold.

As in the above table, we can see that Audi has been sold the most, so that is the frequency distribution table, Now let’s visualize this data

Above image represent the visualization of data, as you are familiar with bar and pie chart I move on with Pareto chart .coming to Pareto principle 80–20% Rule it states that 80% of the effect comes from 20% of the causes

These are the main way in which we can visually Represent Categorical data

coming to Numerical data, Let’s start with the Frequency Distribution table. well, when we deal with numerical variables it makes much more sense to group the data into intervals and then find the corresponding frequency. In this way, we make a summary of the data that allows for meaningful visual representation.

How do we choose these intervals?

Generally, statisticians prefer working with a group of data that contains 5 to 20 intervals. This way the summary can be useful. However, this varies from case to case and the correct choice of individual and depends on the amount of data.

In our example, divide the data in to five intervals of equal length.

In our case the length of the interval should be 100 -1 divided by 5 the round off result is 20. Therefore our interval as follows,

As you saw in the above table each interval has a width of 20 and then for frequency column a number included in the interval if that number

  1. Is a GREATER than the Lower bound
  2. Is LOWER or Equal to the upper bound

For many analyses, it is useful to calculate relative frequency. Let’s add another column and add relative frequency. Relative frequency is calculated by Frequency divided by Total frequency e.g frequency is 2 and the total frequency is 20 i.e 2/20 =0.1.

Now that we have summarized the raw data we can start plotting it.

The most common graph used to represent numerical data is the histogram. we’re going to use the frequency distribution table from our previous examples,

Histogram

As you can see it look like a bar chart but conveys very different information. Each bar has a width equal to the interval and height equal to the frequency, notice that the different bars touching this are to show that there is a continuation between the intervals, each interval ends where the next one start.

This is how we can build a histogram in order to represent numerical data.

So far we have covered that represent only one variable but how to represent relationships between 2 variables. Let introduce CROSS TABLE and SCATTER PLOTS. once again we can divide variables into a categorical and numerical variable.

Let’s start with categorical variables. The most common way to represent them is using cross tables or as some statisticians call them contingency tables. let’s see the example,

you can see the rows showing the type of investment and the column with each investor allocation. once we created a cross table we can do by visualizing the data on to plane a very useful chart. A very useful chart is variation of the bar chart shared side by side bar chart.

we can easily compare asset holdings for a specific investor or among investors.

Finally, we would like to conclude with a very important graph Scatter plot. A Scatter plot is used to representing two numerical variables for this example we have gathered the Reading and practical skill-based on scores of 100 individuals students. Let us on see the graph

scatter plot

Scores range between 200 to 800 points that’s why it is bounded within the range of 200 to 800. second, our vertical axis shows practical skills while the horizontal axis shows reading skills. Each point gives us information about a particular student’s performance.

A Scatter plot usually Represents a lot of observations it gives the main idea of how the data is distributed. we can see that there is an obvious uptrend i.e lower Reading skills scores have been achieved by the student with Lower practical skill scores and higher reading skill scores have been achieved by students with higher practical skill scores.

we can see that the student in the middle of the graph with a score in the region of 450 to 550 both Reading and practical skills we can see that the student in the middle of the graph with the score in the region of 450 to 550 both Reading and practical skills.

We have gone through all the basic understating of data. check out my previous blog for the basic understanding of descriptive statistics.

Let’s move on to the heart of descriptive statistics i.e MEASUREMENT OF CENTRAL TENDENCY AND VARIABILITY.

They are three measure of central tendency MEAN, MEDIAN, MODE

Mean is also known as a simple average it is denoted by mu for population and x bar for a sample.

mean formula

The mean is the most common measure of central tendency but it has a huge downside i.e it is easily affected by outliers. The mean is not enough to make a definite conclusion, so yes we can calculate the second measure the median.

The Median is the middle number in an ordered data set. To find the median arrange the observation in order from ascending to descending value. let introduce another measure the mode, the mode is the value that occurs most often it can be used for both numerical and categorical data.

Which measures is best ?

The measure of central tendency should be used together rather than independently.

we discuss the measure of central tendency let’s move on to the MEASURE OF ASYMMETRY.

The most commonly used tool to measure asymmetry is SKEWNESS

Formula for Skewness

Almost always you will use software that performs a calculation. so we are not going to computation rather than know about the meaning of skewness. Skewness indicates whether the data is concentrated on one side. Skewness are classify in to POSITIVE SKEWNESS AND NEGATIVE SKEWNESS ,let’s see the example

positive skew

From the graph, you can see that the data points are concentrated on the left side and also notice that the which side a tail is leaning to. If mean is greater than median then it is called right or positive skewness i.e outliers are to the right .

Negative skewness

From the graph, you can see that the data points are concentrated on the Right side and also notice that the which side a tail is leaning to.So left or negative skewness means that outliers are to the left.

Now if the case has no skew then mean, median and mode are all equal ,let’s see the example.

No skew

Why skewness is important ?

Skewness tells about where the data is situated. The measure of asymmetries like skewness or the link between central tendency measures and probability theory, which helps us to completely understanding data.

let’s discuss about MEASURE OF RELATIONSHIP BETWEEN VARIABLES — COVARIANCE AND CORRELATION COEFFICIENT

COVARIANCE : It is measure of the joint variability of 2 random variables

In the above scatter plot clearly shows that the relationship between variables i.e the two variables are correlated and the main statistic to measure this correlation is called covariance.

Covariance may be

  1. >0
  2. =0
  3. <0

Formula to calculate the covariance between 2 variables, It is a sample data so we use sample formula.

covariance gives a sense of direction, ie

CORRELATION COEFFICIENT :

Correlation adjusts covariance so that the relationship between the two variables becomes easy. The formula for the correlation coefficient,

If correlation coefficient is one then there is Perfect positive correlation i.e the entire variability of one variable is explained by other variable .

If the correlation coefficient is zero between two variable then there are Independent variables

Finally the Negative correlation it can be Perfect negative correlation of -1 or imperfect negative correlation (-1,0 )

Note: Correlation between x and y is same as y and x so this leads to causality. It is very important for any analyst or researcher to understand the direction of casual Relationships.

Thank you for reading. I hope that this article could give you some brief Data science concepts about Descriptive statistical analysis.

--

--