Statistics Fundamentals Using Python Libraries

Source Twitter (@bighit_merch)

Statistics is a branch of mathematics which touches every aspect of data science. A data scientist should know the basics of statistics and the libraries to perform statistical analysis. The following are the basic operations in statistics and the methods to perform it.

  • Mean- It’s an average of numbers.
  • Median- It is a middle value of a sorted list of numbers.
  • Mode- It is a most frequent value in numbers.
  • Variance- It represents how each value differs from a mean value.
  • Standard Deviation- It is a square root of variance.

Three libraries we will use are:

  1. Numpy
  2. Scipy
  3. Statistics

All three are python libraries which provides functions to calculate statistics of numeric data. The basic difference between these three is numpy is powerful for multi dimensional arrays while the rest two are not much useful for complex N-dimensional objects.

First let’s create random data with numpy. Generate random numbers from 1 to 100 with array size 10.

import numpy as np
data = np.random.randint(low=1, high=100, size=10)
print(data)

Lets find mean, median, variance and standard deviation using numpy

mean = np.mean(data)
print(mean)
median = np.median(data)
print(median)
variance= np.var(data)
print(variance)
sd = np.std(data)
print(sd)

There is no direct method to find mode in numpy but scipy have a module called ‘stats’ which serves the purpose.

from scipy import stats
mode = stats.mode(data)
print(mode)

These operations can also be done using statistics library. For array size n if all the values are unique the error will be thrown ‘ no unique mode; found n equally common values’.

import statistics mean = statistics.mean(data)
print(mean)
median = statistics.median(data)
print(median)
mode = statistics.mode(data)
print(mode)
variance = statistics.variance(data)
print(variance)
sd = statistics.stdev(data)
print(sd)

All the operations we have seen are done on 1-D array. For n-D arrays we have to specify the axis we want to work on. When axis = 0 and axis = 1, the operations will be performed column-wise and row-wise respectively. If no axis is specified the operations will be computed on the flattened array.

Let’s create a matrix of size 3*3 using numpy containing integers from 1 to 10.

import numpy as np
data = np.random.randint(1,10,size=(3,3))
print(data)

Now lets calculate mean, median, variance and standard deviation column-wise.

mean = np.mean(data, axis=0)
print(mean)
median = np.median(data, axis=0)
print(median)
variance= np.var(data, axis=0)
print(variance)
sd = np.std(data, axis=0)
print(sd)

To find most frequent elements scipy library can be used for n-D arrays.

from scipy.stats import mode
mode = mode(data, axis=0)
print(mode)

Understanding of these operations and libraries comes in handy to peak into numerical data and to know the more about you data. While pre-processing the data it’s important to know what values are revolving around and outliers can be detected by using simple steps.

--

--