Numpy uncovered: A beginner’s guide to statistics using Numpy

Md Khalid Siddiqui
Analytics Vidhya
Published in
8 min readSep 1, 2020

Summarizing key takeaways from my learnings of the famous Python scientific Library Numpy .

To run the codes as shown in this article, use this online Python compiler. This will help you understand and retain the concepts as you move along and also help you develop the habit of practicing coding .

Many companies hiring for Python positions specifically require knowledge of Numpy library.

NumPy is a python library used for working with arrays. It also has functions for working in domain of linear algebra, Fourier transform, and matrices. NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely. NumPy stands for Numerical Python.

In Python we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array object that is up to 50x faster that traditional Python lists.

The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy. Arrays are very frequently used in data science, where speed and resources are very important.

Learning outcomes:

Applying Numpy to calculate statistical concepts:

  • Mean
  • Median
  • Percentiles
  • Interquartile Range
  • Outliers
  • Standard Deviation

MEAN:

Before using Numpy on a dataset, we need to convert it into array. Array in Python is similar to list in Python. It is represented by square braces with values inside it separated by comma. In order to to perform array operations on a list, we need to first transform it into array.

Example:

survey_responses = [5, 10.2, 4, .3,6.6]

We can then transform the dataset into a NumPy array using

survey_array = np.array(survey_responses)

Note: np is an alias for Numpy

Calculating mean:

survey_mean = np.mean(survey_array)

Output:

Python program for importing numpy, creating an array from list and then finding the mean using np.mean method

0-D Arrays

0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.

Eg: arr = np.array(42)

1-D Arrays

An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array. These are the most common and basic arrays.

Eg: arr = np.array([1, 2, 3, 4, 5])

2-D Arrays

An array that has 1-D arrays as its elements is called a 2-D array. These are often used to represent matrix or 2nd order tensors.

Eg: arr = np.array([[1, 2, 3], [4, 5, 6]])

3-D arrays

An array that has 2-D arrays (matrices) as its elements is called 3-D array. These are often used to represent a 3rd order tensor.

Eg: arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

n-D arrays

An array that has (n-1)-D arrays as its elements is called n-D array.

Check Number of Dimensions of an Array

NumPy Arrays provides the ndim attribute that returns an integer that tells us how many dimensions the array have.

using ndim to print out dimensions of arrays

Creating Higher Dimensional Arrays:

An array can have any number of dimensions.

When the array is created, you can define the number of dimensions by using the ndmin argument.

Eg: arr = np.array([1, 2, 3, 4], ndmin=5)

note: For finding dimensions “ndim” is used while for creating dimensions “ndmin” is used. Both should not be confused to be same.

Example problem 4 parts:

  1. We’re provided with data about a trial for a new allergy medication, AllerGeeThatSucks! Five participants were asked to rate how drowsy the medication made them once a day for three days on a scale of one (least drowsy) to ten (most drowsy).Use np.mean to find the average level of drowsiness across all the trials and save the result to the variable total_mean.

allergy_trials = np.array([[6, 1, 3, 8, 2], [2, 6, 3, 9, 8], [5, 2, 6, 9, 9]])

2. Use np.mean to find the average level of drowsiness across each day of the experiment and save to the variable trial_mean.

3. Use np.mean to find the average level of drowsiness across for each individual patient to see if some were more sensitive to the drug than others and save it to the variable patient_mean.

4. Print the variables for total_mean, trial_mean, and patient_mean on three separate lines.

Output:

Code block for problems 1 to 4 and output displayed outside the block. Text following ’ # ’ sign is excluded from the compilation of program. It is used to specify any details or intent of the program.

Question

What is an axis in Numpy?

Answer

An axis is similar to a dimension. For a 2-dimensional array, there are 2 axes: vertical and horizontal.

When applying certain Numpy functions like np.mean(), we can specify what axis we want to calculate the values across.

For axis=0, this means that we apply a function along each “column”, or all values that occur vertically.

For axis=1, this means that we apply a function along each “row”, or all values horizontally.

OUTLIERS

As we can see, the mean is a helpful way to quickly understand different parts of our data. However, the mean is highly influenced by the specific values in our data set. What happens when one of those values is significantly different from the rest?

Values that don’t fit within the majority of a dataset are known as outliers. It’s important to identify outliers because if they go unnoticed, they can skew our data and lead to error in our analysis (like determining the mean). They can also be useful in pointing out errors in our data collection.

When we’re able to identify outliers, we can then determine if they were due to an error in sample collection or whether or not they represent a significant but real deviation from the mean.

Sorting and Outliers

One way to quickly identify outliers is by sorting our data, Once our data is sorted, we can quickly glance at the beginning or end of an array to see if some values lie far beyond the expected range. We can use the NumPy function np.sort to sort our data.

Let’s take 3rd grade student’s height example, and imagine an 8th grader walked into our experiement:

>>> heights = np.array([49.7, 46.9, 62, 47.2, 47, 48.3, 48.7])

If we use np.sort, we can immediately identify the taller student since their height (62”) is noticeably outside the range of the dataset:

>>> np.sort(heights)
array([ 46.9, 47. , 47.2, 48.3, 48.7, 49.7, 62])

Reverse sorting: for any given array we can reverse the order of elements by np.sort(array_name[::-1])

MEDIAN

Another key metric that we can use in data analysis is the median. The median is the middle value of a dataset that’s been ordered in terms of magnitude (from lowest to highest).

Let’s look at the following array:

np.array( [1, 1, 2, 3, 4, 5, 5])

In this example, the median would be 3, because it is positioned half-way between the minimum value and the maximum value.

If the length of our dataset was an even number, the median would be the value halfway between the two central values. So in the following example, the median would be 3.5:

np.array( [1, 1, 2, 3, 4, 5, 5, 6])

But what if we had a very large dataset? It would get very tedious to count all of the values. Luckily, NumPy also has a function to calculate the median, np.median:

>>> my_array = np.array([50, 38, 291, 59, 14])
>>> np.median(my_array)
50.0

Mean vs. Median

In a dataset, the median value can provide an important comparison to the mean. Unlike a mean, the median is not affected by outliers. This becomes important in skewed datasets, datasets whose values are not distributed evenly.

PERCENTILES

As we know, the median is the middle of a dataset: it is the number for which 50% of the samples are below, and 50% of the samples are above. But what if we wanted to find a point at which 40% of the samples are below, and 60% of the samples are above?

This type of point is called a percentile. The Nth percentile is defined as the point N% of samples lie below it. So the point where 40% of samples are below is called the 40th percentile. Percentiles are useful measurements because they can tell us where a particular value is situated within the greater dataset.

Let’s look at the following array:

d = [1, 2, 3, 4, 4, 4, 6, 6, 7, 8, 8]

There are 11 numbers in the dataset. The 40th percentile will have 40% of the 10 remaining numbers below it (40% of 10 is 4) and 60% of the numbers above it (60% of 10 is 6). So in this example, the 40th percentile is 4.

In NumPy, we can calculate percentiles using the function np.percentile, which takes two arguments: the array and the percentile to calculate.

Here’s how we would use NumPy to calculate the 40th percentile of array d:

>>> d = np.array([1, 2, 3, 4, 4, 4, 6, 6, 7,  8, 8])
>>> np.percentile(d, 40)
4.00

Some percentiles have specific names:

  • The 25th percentile is called the first quartile
  • The 50th percentile is called the median
  • The 75th percentile is called the third quartile

The minimum, first quartile, median, third quartile, and maximum of a dataset are called a five-number summary. This set of numbers is a great thing to compute when we get a new dataset.

The difference between the first and third quartile is a value called the interquartile range. For example, say we have the following array:

d = [1, 2, 3, 4, 4, 4, 6, 6, 7, 8, 8]

We can calculate the 25th and 75th percentiles using np.percentile:

np.percentile(d, 25)
>>> 3.5
np.percentile(d, 75)
>>> 6.5

Then to find the interquartile range, we subtract the value of the 25th percentile from the value of the 75th:

6.5 - 3.5 = 3

50% of the dataset will lie within the interquartile range. The interquartile range gives us an idea of how spread out our data is. The smaller the interquartile range value, the less variance in our dataset. The greater the value, the larger the variance.

STANDARD DEVIATION

When the standard deviation is small, the values will be less spread out and be closer to the mean. This will cause the overall shape of this dataset to appear less chaotic and more leveled.

When the standard deviation is large, the values will be more spread out from the mean. The shape of the dataset will appear to be more uneven and chaotic as the standard deviation increases.

We can find the standard deviation of a dataset using the Numpy function np.std:

>>> nums = np.array([65, 36, 52, 91, 63, 79])
>>> np.std(nums)
17.716909687891082

--

--