Numpy Basics for Machine Learning

Aryan Chugh
The Startup
Published in
7 min readJul 11, 2020

--

Introduction

This article will cover the basics of data manipulation using NumPy and some useful statistical functions to work with mathematical data. In order to get the best out of our data, we should know the tools and functions to read into/know our data and NumPy provides us with all the essential mathematical and functions to handle our data.

All the codes are provided in my Github repository, click here to visit it. I have started a series of well explanatory articles that will cover all major topics of data science pipeline, visit my main Github repository to read more such articles and codes.

Topics to be covered :

  1. NumPy Basics
  2. Random Generators in NumPy
  3. Statistical Computation using NumPy

NumPy Basics

  • Creating Arrays

We can create NumPy arrays using any vector/matrix generator (like arange function in NumPy) or from python lists. We can also use NumPy's inbuilt functions for creating specific matrices or arrays like numpy.ones, numpy.eye, etc.

One of the most important attributes of any NumPy array or data is its dimensions and we can get the dimension or any NumPy array using its shape attribute. In layman's terms, the shape attribute tells us how many rows and columns our matrix has and if it is 2-Dimensional or 3-Dimensional or is it's spanning over several dimensions.

import numpy as np# Using inbuilt numpy range function
a = np.arange(10) # can also give start, stop and step as parameters like normal range function
print(a)
print("-------------------------")
# Using lists
a = np.array([1,2,3,4,5])
print(a)
print(type(a))
print("-------------------------")
print(a.shape)

Output

[0 1 2 3 4 5 6 7 8 9]
-------------------------
[1 2 3 4 5]
<class 'numpy.ndarray'>
-------------------------
(5,)

A clearer example of the shape attribute:

c = np.array([[1,2,3], [4,5,6]])
print(c)
print(c.shape)
print(c[1][1])

Output

[[1 2 3]
[4 5 6]]
(2, 3)
5

Some inbuilt NumPy functions for data generation:

a = np.zeros((3,3)) # Have to specify (rows, cols) in a tuple ALWAYS.print(a)
print("-------------------------")
b = np.ones((2,3))
print(b)
print("-------------------------")
# Array of some constants
c = np.full((3,2), 5)
print(c)
print("-------------------------")
# Identity matrix - size/square matrix
d = np.eye(4)
print(d)
print("-------------------------")
# Random matrix
randomMatrix = np.random.random((2,3))
print(randomMatrix)
# All inbuilt functions will generate matrices with float values unless mapped or specified

Output

[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
-------------------------
[[1. 1. 1.]
[1. 1. 1.]]
-------------------------
[[5 5]
[5 5]
[5 5]]
-------------------------
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
-------------------------
[[0.90284075 0.18544622 0.85348442]
[0.40376043 0.87817593 0.97004446]]
  • Reading and Updating Elements

Just like python sequential data types, NumPy arrays also support slice notation for accessing data in n dimensions. A semicolon (:) in any of the dimensions indicates that we select all the items in that dimension.

Example:

# If we want to print all the rows and just the first column of the randomMatrix:
print(randomMatrix[:,1])

Output

[0.18544622 0.87817593]

We can give a range of start and stop index for each dimension in the slice notation and a jump parameter can also be given for each dimension is necessary.

Syntax: (for 3_dimensional array)

Matrix[start:stop:step, start:stop:step, start:stop:step]

This notation is valid for n dimensions. The most important point to remember is that stop is not inclusive hence the actual range is: [start, stop) / [start, stop-1]. And all the indexing starts from zero.

Example:

print(randomMatrix)
randomMatrix[1, 1:] = 1 # Selects columns 2 and 3 of 2nd row and updates them
print(randomMatrix)

For more information on slicing and indexing of NumPy arrays please visit this website.

  • Mathematical Operations

NumPy arrays support all kinds of mathematical operations like the addition/subtraction of arrays or a constant, multiplication/division or arrays, dot product, etc.

Most of the matrix operations on NumPy arrays are element-wise except some operations like dot and cross product.

Example:

x = np.array([[1,2], [3,4]])
y = np.array([[5,6], [7,8]])
# Element Wise Multiplication
print(x*y)
print(np.multiply(x,y))
print('------------------------------------')
# Element Wise Square root
print(x**(0.5))
print(np.sqrt(x))
print('------------------------------------')
# Dot Product
print(x.dot(y))
print(np.dot(x,y))

Output

[[ 5 12]
[21 32]]
[[ 5 12]
[21 32]]
------------------------------------
[[1. 1.41421356]
[1.73205081 2. ]]
[[1. 1.41421356]
[1.73205081 2. ]]
------------------------------------
[[19 22]
[43 50]]
[[19 22]
[43 50]]

For Stacking and Reshaping of arrays please click here to visit my Jupiter notebook with explanatory code.

Random Generators in NumPy

NumPy’s random module is a very important and functional module that supports various functions for random sampling of arbitrary or normal distributions. This module is helpful in many games where random sampling is needed like cards game or roll of a dice. One of the most important features of this library is that we can set the seed for random number generation to get consistent results or fixed results in the initial phases.

Some of the common functions are:

  • Rand: Random values in a given shape
  • Randn: Return a sample(or samples) from the “standard normal” distribution
  • Randint: Return random integers from low(inclusive) to high(exclusive)
  • Random: Return random floats in the half-open interval [0.0, 1.0)
  • Choice: Generates a random sample from a given 1-D array
  • Shuffle: shuffles the contents of a sequence

Examples:

a = np.arange(10) + 5
print(a)
print('------------------------------------')
np.random.shuffle(a)
print(a)

Output

[ 5  6  7  8  9 10 11 12 13 14]
------------------------------------
[ 9 5 13 6 14 12 7 11 8 10]
  • Generating Random Numbers
# Generates a mxn matrix of random floats from standard normal distribution
a = np.random.randn(2,3)
print(a)
print('------------------------------------')
# Randomly picks an element from any array
element = np.random.choice([1,4,6,23,9,34]) # Can take only 1D arrays
print(element)

Output

[[ 1.0275266  -0.51485389  0.02889368]
[-0.30376467 0.52435418 0.14559567]]
------------------------------------
9
  • Seeding in pseudo-random number generators

Seeding is important as the random number generators use the same number once initiated to generate all random arrays or matrices i.e. a number is picked for seeding and every other random action is based on that seed number through various calculations hence by manually ensuring seed value for these generators all the random actions will produce the same result every time we run our code snippet.

np.random.seed(1)
element = np.random.choice([1,4,6,23,9,34])
print(element)

Output

34

As we gave the seed manually, we will always get 34 as the result of our code snippet.

Statistical Computation using NumPy

One of the major advantages of using NumPy arrays over normal python lists is the inbuilt statistical functions that help data science enthusiasts the most. Numpy provides all the necessary statistical functions to analyze data in a proper way, some of the functions are listed below:

  • min, max
  • mean
  • median
  • average
  • variance
  • standard deviation

We will start with simple operations like min & max along different axes. The following code is very simple and needs no explanation.

a = np.array([[1,2,3,4], [7,6,2,0]])
print(a)
# To get minimum element of an array
print(np.min(a))
print("------------------------------------")
# To get minimum elements in each column
print(np.min(a, axis = 0))
print("------------------------------------")
# To get minimum elements in each row
print(np.min(a, axis = 1))

Output

[[1 2 3 4]
[7 6 2 0]]
0
------------------------------------
[1 2 2 0]
------------------------------------
[1 0]
  • Mean: Average of all the elements

Mean is nothing but taking the sum of all the elements and dividing it by the number of elements.

b = np.array([1,2,3,4,5])# Calculating mean directly
print(np.mean(b))
print("------------------------------------")
# Mean along column
print(np.mean(a, axis=0))
print("------------------------------------")
# Mean along row
print(np.mean(a, axis=1))

Output

3.0
------------------------------------
[4. 4. 2.5 2. ]
------------------------------------
[2.5 3.75]
  • Mean vs Average

People generally use the terms mean and average very loosely and there is not a lot of difference between them, Average is the same but it is or can be weighted if we want it to be. By weighted we mean that there is a weight assigned to each element which is multiplied to its respective element when computing the average and then divided by the total number of elements. The mean is simple and can never be weighted.

c = np.array([1,5,4,2,0])# Mean
print(np.mean(c))
print("------------------------------------")
# Average: Simple
print(np.average(c))
print("------------------------------------")
# Average: Weighted
w = np.array([1,2,3,4,5])
print(np.average(c, weights=w))

Output

2.4
------------------------------------
2.4
------------------------------------
2.066666666666667
  • Median: Middlemost value

The mean may not be a fair representation of the data, because the average is easily influenced by outliers (very small or large values in the data set that are not typical). The median is another way to measure the center of a numerical data set. In a numerical data set, the median is the point at which there is an equal number of data points whose values lie above and below the median value. Thus, the median is truly the middle of the data set.

print(np.median(c))
print("------------------------------------")
# Mean: column and row wise
print(np.mean(a, axis = 0))
print("------------------------------------")
print(np.mean(a, axis = 1))

Output

2.0
------------------------------------
[4. 4. 2.5 2. ]
------------------------------------
[2.5 3.75]
  • Standard Deviation
  1. This is the total deviation of all the values from the mean or average value.
  2. Formula: sqrt(summation((x — mean)²)/N)
  3. Variance = square of standard deviation
  4. We squared the difference of each value and mean ‘(x-u)’ because it eliminates the error of minus sign or of absolute value
  5. A low standard deviation means that most of the numbers are close to the average. A high standard deviation means that the numbers are more spread out.
# Standard Deviation by inbuilt formula
std = np.std(c)
print(std)

Output

1.854723699099141

Conclusion

I hope this article gives a brief insight into using NumPy for various applications and encourages you to learn more or create something good with it.

Please feel free to browse through my Github repository for more interesting projects and explanatory codes.

If you like any of my work please feel free to contact me related to any project collaboration or job opportunity.

--

--