Intro to Statistics: Analysis of Chinook dataset using statistics
Statistics allows us hear the story data presents and aim at us to use that information to formulate more of a realistic idea when it comes to interpreting, and analyzing the existing situation, which in turn leads to assessing future decisions. When used correctly, statistics tell us any trends in what happened in the past and can be useful in predicting what may happen in the future.
This project is from Gitgirl School of Data. In this session, we learned about Statistics which is simply using numerical values to reflect life events and events that occur in life.I would be using Statistics to identify trends and features within chinook dataset, i.e. total spending revenue of each country.
Measures of Central Tendency
Mean
The mean also known as the average is equal to the sum of all the values in the data set divided by the number of values in the data set.
Median
The median is the middle number for a set of data that has been arranged from least to greatest. when there is an even number of data, add the middle two numbers and average the result.
Mode
The mode is the number that occurred the most in a data.
Range
The range is highest minus the lowest.
To get the statistics for the for the chinook dataset, I used the Analysis Toolpak add-in Excel to generate descriptive statistics.
From the result above, the mean is $97.025k. Inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical spending revenue of the countries, as most countries have spending revenue in the $37.62k to 100k range. The mean is being skewed by the six large revenue.
The Median $44.12k is preferred over the mean or mode when the data is skewed because it retains its position and is not strongly influenced by the skewed values. the mean loses its ability to provide the best central location for the data because the skewed data is dragging it away from the typical value.
The mode is $37.62k,this is because it occurred the most in the data.
probability
Probability means the likelihood of the occurrence of an event.
Examples of events can be :
- Tossing a coin with the head up
- Drawing a red pen from a pack of different colored pens
- Drawing a card from a deck of 52 cards etc.
Correlation and Regression
Correlation
The correlation coefficient (a value between -1 and +1) tells you how strongly two variables(X and Y) are related to each other. A correlation coefficient of +1 indicates a perfect positive correlation. As variable X increases, variable Y increases. As variable X decreases, variable Y decreases.
Regression
Regression analysis is a set of statistical processes for estimating the relationships among variables. regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.