Statistics Note

Shirsh Verma
Mar 19 · 15 min read

As Josh Wills once said, “Data Scientist is a person who is better at statistics than any programmer and better at programming than any statistician.” .

Hi Guys, my name is Shirsh and I am a aspiring Data Scientist ,currently I am building my skill sets to excel in this domain , in this blog I will be summarizing all important Statistics Concept which I have learnt till date.

Doing this will help me build my concept better and would also help guys who are new to this field.

So let’s get started

What do I mean when I say what is Probability?

1.There is 50% chance of rains tomorrow evening

2. There in .33 probability of you failing

Probability in simple words means : Unable to predict the outcomes, but in the long-run, the outcomes exhibit statistical regularity.

Example:

Tossing a fair coin — outcomes S ={Head, Tail} Unable to predict on each toss whether is Head or Tail. In the long run can predict that 50% of the time heads will occur and 50% of the time tails will occur.

The sample Space, S
The sample space, S, for a random phenomena is the set of all possible outcomes.

Event , E
The event, E, is any subset of the sample space, S. i.e. any set of
outcomes (not necessarily all outcomes) of the random phenomena

Definition: probability of an Event E.

Suppose that the sample space S = {o1, o2, o3, … oN} has a finite number, N, of outcomes. Also each of the outcomes is equally likely (because of symmetry). Then for any event E

If two events A and B are are mutually exclusive then:

Conditional Probability

Definition

Suppose that we are interested in computing the probability of event A and we have been told event B has occurred. Then the conditional probability of A given B is defined to be:

Bayes’ Theorem

Bayes’ Theorem is used to revise previously calculated probabilities based on new information. Developed by Thomas Bayes in the 18th Century. It is an extension of conditional probability.

Random Variable

A numerical outcome of any random phenomenon is called an Random Variable.

A random variable x takes on a defined set of values with different probabilities.

Random variables can be discrete or continuous

. Discrete random variables have a countable number of outcomes

. Continuous random variables have an infinite continuum of possible values.

Probability functions

A probability function maps the possible values of x against their respective probabilities of occurrence, p(x) p(x) is a number from 0 to 1.0. The area under a probability function is always 1.

Types of Probability function:

There are mainly three types of functions

Probability Mass Function (PMF)

The Probability Mass Function (PMF) also called a probability function or frequency function which characterizes the distribution of a discrete random variable. Let X be a discrete random variable of a function, then the probability mass function of a random variable X is given by

Px (x) = P( X=x ), For all x belongs to the range of X

It is noted that the probability function should fall on the condition :

Px (x) ≥ 0 and
∑xϵRange(x) Px (x) = 1

Here the Range(X) is a countable set and it can be written as { x1, x2, x3, ….}. This means that the random variable X takes the value x1, x2, x3, ….

Cumulative Distribution Function (CDF)

The Cumulative Distribution Function (CDF), of a real-valued random variable X, evaluated at x, is the probability function that X will take a value less than or equal to x. It is used to describe the probability distribution of random variables in a table. And with the help of these data, we can create a CDF plot in excel sheet easily.

In other words, CDF finds the cumulative probability for the given value. To determine the probability of a random variable, it is used and also to compare the probability between values under certain conditions. For discrete distribution functions, CDF gives the probability values till what we specify and for continuous distribution functions, it gives the area under the probability density function up to the given value specified. In this article, we are going to discuss the formulas, properties and examples of the cumulative distribution function.

The CDF defined for a discrete random variable is given as

Fx(x) = P(X≤x)

Where X is the probability that takes a value less than or equal to x and that lies in the semi-closed interval (a,b], where a < b.

Therefore the probability within the interval is written as

P(a < X ≤ b)=Fx(b)-Fx(a)

Probability Density Function (PDF)

Probability Density Function (PDF) is used to define the probability of the random variable coming within a distinct range of values, as objected to taking on anyone value. The probability density function is explained here in this article to clear the concepts of the students in terms of its definition, properties, formulas with the help of example questions. The function explains the probability density function of normal distribution and how mean and deviation exists. The standard normal distribution is used to create a database or statistics, which are often used in science to represent the real-valued variables, whose distribution are not known.

NOTE :PDF is defined for continuous random variables whereas PMF is defined for discrete random variables.

# for inline plots in jupyter
%matplotlib inline
# import matplotlib
import matplotlib.pyplot as plt
# for latex equations
from IPython.display import Math, Latex
# for displaying images
from IPython.core.display import Image
# import seaborn
import seaborn as sns
# settings for seaborn plotting style
sns.set(color_codes=True)
# settings for seaborn plot sizes
sns.set(rc={‘figure.figsize’:(5,5)})

1. Uniform Distribution

Perhaps one of the simplest and useful distribution is the uniform distribution. The probability distribution function of the continuous uniform distribution is:

Since any interval of numbers of equal width has an equal probability of being observed, the curve describing the distribution is a rectangle, with constant height across the interval and 0 height elsewhere. Since the area under the curve must be equal to 1, the length of the interval determines the height of the curve. The following figure shows a uniform distribution in interval (a,b). Notice since the area needs to be 1. The height is set to 1/(b−a).

# import uniform distribution
from scipy.stats import uniform
# random numbers from uniform distribution
n = 10000
start = 10
width = 20
data_uniform = uniform.rvs(size=n, loc = start, scale=width)
ax = sns.distplot(data_uniform,
bins=100,
kde=True,
color=’green’,
hist_kws={“linewidth”: 15,’alpha’:1})
ax.set(xlabel=’Uniform Distribution ‘, ylabel=’Frequency’)

2. Normal Distribution

Normal Distribution, also known as Gaussian distribution, is ubiquitous in Data Science. We will encounter it at many places especially in topics of statistical inference. It is one of the assumptions of many data science algorithms too.

# alternative code
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math

mu = 0 # mean
variance = 100
sigma = math.sqrt(variance) # sigma is standard deviation
x = np.linspace(mu — 4*sigma, mu + 4*sigma, 100)
x1 = np.linspace(mu — 4*5, mu + 4*5, 100)
x2 = np.linspace(mu — 4*1, mu + 4*1, 100)
plt.plot(x, stats.norm.pdf(x, 0, sigma))
plt.plot(x1, stats.norm.pdf(x1, 5, 5))
plt.plot(x2, stats.norm.pdf(x2, -1, 1))

plt.show()

3. Poisson Distribution

Poisson random variable is typically used to model the number of times an event happened in a time interval. For example, the number of users visited on a website in an interval can be thought of a Poisson process. Poisson distribution is described in terms of the rate (μ) at which the events happen. An event can occur 0, 1, 2, … times in an interval. The average number of events in an interval is designated λ (lambda). Lambda is the event rate, also called the rate parameter. The probability of observing k events in an interval is given by the equation :

Poisson Distribution graph

From the above we can say that this Poisson distribution is forming a Bell curved shape and hence can also be called as a Discrete Normal Distribution

Probability of Multiple Random Variables

In machine learning, we are likely to work with many random variables.

For example, given a table of data, such as in excel, each row represents a separate observation or event, and each column represents a separate random variable.

Variables may be either discrete, meaning that they take on a finite set of values, or continuous, meaning they take on a real or numerical value.

As such, we are interested in the probability across two or more random variables.

This is complicated as there are many ways that random variables can interact, which, in turn, impacts their probabilities.

This can be simplified by reducing the discussion to just two random variables (X, Y), although the principles generalize to multiple variables.

And further, to discuss the probability of just two events, one for each variable (X=A, Y=B), although we could just as easily be discussing groups of events for each variable.

Therefore, we will introduce the probability of multiple random variables as the probability of event A and event B, which in shorthand is X=A and Y=B.

We assume that the two variables are related or dependent in some way.

As such, there are three main types of probability we might want to consider; they are:

Joint Probability

The probability of two (or more) events is called the joint probability. The joint probability of two or more random variables is referred to as the joint probability distribution.

For example, the joint probability of event A and event B is written formally as:

Marginal Probability

We may be interested in the probability of an event for one random variable, irrespective of the outcome of another random variable.

The probability of one event in the presence of all (or a subset of) outcomes of the other random variable is called the marginal probability or the marginal distribution. The marginal probability of one random variable in the presence of additional random variables is referred to as the marginal probability distribution.

Its formula is given by P(X=A) = summation P(X=A, Y=y_i) for all y and a fixed X=A

Descriptive Statistics

Descriptive Statistics are Used by Researchers to Report on Populations and Samples

What Are Descriptive Statistics?

Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency include the mean, median, and mode, while measures of variability include standard deviation, variance, minimum and maximum variables, kurtosis, and skewness.

Measures of Central Tendency

Consider the following data points.

17, 16, 21, 18, 15, 17, 21, 19, 11, 23

Since the number of observations is even (10), median is given by the average of the two middle observations (5th and 6th here).

Using Python we can achieve all of this in a seconds

import numpy as np
from scipy import stats

dataset= [17, 16, 21, 18, 15, 17, 21, 19, 11, 23]

#mean value
mean= np.mean(dataset)

#median value
median = np.median(dataset)

#mode value
mode= stats.mode(dataset)

print(“Mean: “, mean)
print(“Median: “, median)
print(“Mode: “, mode)

Measures of Dispersion (or Variability)

Measures of Dispersion describes the spread of the data around the central value (or the Measures of Central Tendency)

The most commonly used method of calculating Skewness is

If the skewness is zero, the distribution is symmetrical. If it is negative, the distribution is Negatively Skewed and if it is positive, it is Positively Skewed.

Statistical inference

The fundamental idea is to use data from a sample to infer information about the population. What I mean is, it’s impractical to actually evaluate whole of the population, so instead take a small sample of the population that effectively can represent the whole population.

For example, in a drug testing scenario, you will select a group of people or subjects such that almost equal representation exists, one on whom the new drug is being tested and one whom the old drug is being tested in order to compare.

To sum up

The process of making guess/inference about the truth from a sample is called Statistical Inference

Central Limit Theorem

If this procedure is performed many times, the central limit theorem says that the distribution of the average will be closely approximated by a normal distribution. A simple example of this is that if one flips a coin many times the probability of getting a given number of heads in a series of flips will approach a normal curve, with mean equal to half the total number of flips in each series. (In the limit of an infinite number of flips, it will equal a normal curve.)

When the coins are flipped for a very large n number of times distribution of number of heads associated with each throw has started to show a Bell Curve which approximately looks like and normal distribution.

Interval Estimation

Basic Intuition

Margin of Error and Interval Estimate

Point Estimate +/- Margin of Error

Z-distribution

In statistics, the Z-distribution is used to help find probabilities and percentiles for regular normal distributions (X). It serves as the standard by which all other normal distributions are measured. The Z-distribution is a normal distribution with mean zero and standard deviation 1

Interval Estimate of a Population Mean

After we found a point estimate of the population mean, we would need a way to quantify its accuracy. Here, we discuss the case where the population variance σ2 / std deviation σ is assumed known

Meaning of Confidence

Thanks! for reading.

AlmaBetter

AlmaBetter provides risk-free education with rewarding employment opportunities.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store