Statistics 101 for Data Science

Mayur Bhangale
SomX Labs
Published in
3 min readJul 8, 2016

“Data beats emotions.” — Sean Rad, Founder, Ad.ly

I remember my engineering maths professor telling me couple of years back about significance of statistics in computer science. Though it was not making sense for me then but it all turns out to be very important now.

Data analysis is all about using right tools at right times. Its like painting a picture, there are many tools upon which artist can draw on, but their usage can either make ‘just another painting’ or a masterpiece. Statistics is what powers those tools and if mastered can make life of data scientist interesting.

Below are few basic concepts for beginners in data science.

Mean

Mean, as you probably know is another name for the average. All you’ve to do is sum up all the values and divide it by the number of values that you have.

Mean = Sum / Number of samples

Example: If we collect data about amount of water consumed by every person in a certain region, we can get average amount of water needed per person in that region using mean.

Median

Now median is little bit different, the way you calculate median of a data is
by sorting all the values and taking the one that ends up in the middle.

  • Sort the values.
  • Take the value at the midpoint.

If there are even number of samples, take mean of two in the middle.
Median is less susceptible to outliers than mean.

Example: Average household income in India is Rs. 20,000, but the median is only Rs. 8,000 because the mean is skewed by a handful of billionaires.
Median better represents the ‘typical’ Indian in this example.

In python you can easily find mean and median using numpy. Below is random income data centered around 100 with standard deviation of 20 on 10000 data points.

import numpy as np
incomes = np.random.normal(100.0, 20.0, 10000)
print np.mean(incomes)
print np.median(incomes)

Output

99.7886547312
99.624502502

Mode

Mode is the most common value in data set. Its irrelevant to continuous numerical data.

Example:
Amount of water consumed by people in XYZ area -

4,5,7,4,8,4,2,5,4,4
4:5 i.e 4 occurs 5 times, hence mode is 4

Lets find mode in python. We use scipy.stats.mode() here -

import numpy as np
from scipy import stats

a = np.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])

m = stats.mode(a)
print(m)

Output

ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))

Variance

Variance(σ2) shows how spread out the data is. It is simply average of squared differences from mean.

Example:
Figure on left shows arrival frequency of airplanes on an airport. Lets say we have 4 arrivals per minute which happened on around 12th day we were looking at. But then we have these outliers, so we had a really slow day that only had one arrival per minute and one really fast day where we had almost 12 arrivals per minute. So we know from this data, its very likely to have around 4 arrivals per minute but its very unlikely to have 1 or 12.

Continuing income example, variance can be computed with numpy as -

incomes.var()

Standard Deviation

S.D. is square root of variance. This is usually used as a way to identify extremities or outliers. Data points that lie more than one standard deviation from the mean can be considered unusual. You can talk about how extreme a data point is by talking about ‘how many sigmas away’ from the mean it is.

Numpy is also useful in computing standard deviation-

incomes.std()

Above code can be found in IPython notebook here.

--

--

Mayur Bhangale
SomX Labs

Co-Founder at Sourcewiz. Into NLP, Information Retrieval, Knowledge Graphs and Design.