Probability and Statistics for Data Science

Probability is one of the most important and complex field that plays an important role in the field of Data Science. Understanding the math is just as critical as understanding the programming.

Probability is the branch of mathematics that deals with the occurrence of a random event. The probability of an event is a number between 0 and 1., where 0 indicates impossibility of the event and 1 indicates certainty. Probability has been introduced in maths to predict how likely events are to happen.

1.0 What is Probability?

Probability is a measure of likelihood of an event to occur. Probability theory provides powerful tools for modelling and dealing with uncertainty.

Many events cannot be predicted with total certainty so we can only predict the chance of an event to occur. As I earlier said that the probability of an event ranges from 0 to 1. The probability of all events in a sample space adds up to 1.

Let’s examine an example; When a coin is tossed, it’s either the tossed coin gives Head or Tail, the only two possible outcomes are H and T but if two coins are tossed in the air, three possibilities of events occur whereby the two coins can either be both heads (H,H) or a head and tail (H,T) or both tails (T,T).

Probability Formula
The Probability Formula provides the ratio of the number of favourable outcomes to the total number of possible outcomes
Probability of event P(E) = Number of favourable outcomes n(E)/Total Number of outcomes N(S)
There are 2 popular major types of probabilities:
1 Theoretical Probability
2 Experimental Probability

Theoretical Probability
Theoretical Probability is the theory behind probability. It is based on the possible chances of something happening. A simple example to consider is that of coin being tossed, the theoretical probability of getting a head will be ½.

Experimental Probability
This is based on the basis of the observations of an experiment. It can be calculated based on the number of possible outcomes by the total number of trials. For example, if a coin is tossed 20 times and heads is recorded 14 times then, the experimental probability for heads is 14/20 or 7/10.

Probability of an Event
Assume an event E can occur in p ways out of a sum of n probable or possible equally likely ways.
The probability of the event happening is expressed as; P(E) = p/n

p=number of simple events within E

n=total number of possible outcome
The probability of the event not happening is expressed as; P(E’) = (p/n)/n = 1-(p/n)
P(E’) represents that the event won’t happen.
Therefore, we can say;
P(E) + P(E’) = 1
This means that the total of all probabilities in any random test or experiment is equal to 1.

Gaussian Distribution

Gaussian Distribution (also known as Normal Distribution) is the most important probability distribution in statistics because it fits many natural phenomena. It is often used to model variables with unknown distributions in the natural sciences. Machine learning models such as linear and logistic regression assume that the variables are normally distributed. Others benefit from variables that have “Gaussian-like” distributions.

2.0 What is Gaussian Distribution?
Gaussian Distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. Extreme values in both tails of the distribution are similarly unlikely.

Parameters of the Gaussian Distribution
The parameters of the Gaussian distribution includes, the mean and standard deviation. The Gaussian distribution does not have just one form. Instead, the shape changes based on the parameter values.

Mean
The mean is the central tendency of the distribution. It defines the location of the peak for Gaussian distributions.

Standard deviation
The standard deviation is a measure of variability. It defines the width of the Gaussian distribution. It determines how far away from the mean the values tend to fall. It represents the typical distance between observations and the average. Larger standard deviations produce distributions that are more spread out.
If X is a Gaussian random variable with mean µ and standard deviation then, U = X - µ/ σ. µ is the greek symbol for mean while σ is the symbol for standard deviation .

Statistics

Statistics is generally considered a prerequisite to the field of applied machine learning. Statistics is a collection of tools that you can use to get answers to important questions about data.

3.0 What is Statistics?
Statistics is a subfield of mathematics. It refers to a collection of methods for working with data and using data to answer questions.
When it comes to the statistical tools that we use in practice, it can be helpful to divide the field of statistics into two large groups of methods which are; descriptive Statistics for summarizing data and inferential statistics for drawing conclusions from samples of data.

Descriptive Statistics
Descriptive Statistics is the term given to the analysis of data that helps describe, show or summarize data in such a way that we can understand. Descriptive Statistics are very important because if we simply presented our raw data it would be hard to visualize what the data was showing, especially if the data is large. For example, if we had the scores of 500 students’ admitted to a particular department, we may be interested in the overall performance of those students. We might also be interested in the distribution or spreads of the scores. Descriptive statistics allow us to do this; whereby the properties include mean and standard deviation called parameters as they represent the population(the total number of instances you are interested in). Descriptive statistics cover graphical methods that can be used to visualize samples of data. Charts and graphics can provide a useful qualitative understanding of both the shape or distribution of observations as well as how variables may relate to each other.

Inferential Statistics
Inferential Statistics is the process of using data analysis to deduce properties of an underlying distribution of probability. It infers properties of a population, for example testing hypotheses and deriving estimates. Based on our assumption, the observed data set Is sampled from a larger population. In machine learning, the term inference is sometimes used instead to mean “make a prediction, by evaluating an already trained model. Deducing properties of the model is referred to as training or learning rather than using inference and using a model for prediction is referred to as inference instead of prediction.

The Startup

Get smarter at building your thing. Join The Startup’s +729K followers.