Numpy Essential — Part 1
NumPy is the fundamental package for scientific computing with Python. It is the most basic and a powerful package for working with data in python.
If you are going to work on data analysis or machine learning projects, then having a solid understanding of numpy is nearly mandatory.
Other packages for data analysis [ like pandas ] is built on top of numpy and scikit-learn package which is used to build machine learning applications work heavily with numpy.
So what does numpy provide?
At the core, numpy provides the excellent ndarray objects, short for n-dimensional arrays. In a ‘ndarray’ object, aka ‘array’, you can store multiple items of the same data type. It is the facilities around the array object that makes numpy so convenient for performing math and data manipulations.
This is a series of blogs which has 3 parts. Part-1 (this tutorial), Part-2 and Part-3 covers essential functions of numpy which most of you will be using it on daily basis.
1. How to create a numpy array?
There are multiple ways to create a numpy array most of which will be covered as you read this.
# Create an 1d array from a list
import numpy as nplist1 = [1,2,3,4,5,6,7]arr1 = np.array(list1)# Print the array and its typeprint(type(arr1))arr1#> class 'numpy.ndarray'
#> array([0, 1, 2, 3, 4, 5, 6, 7])
The key difference between an array and a list is, arrays are designed to handle vectorized operations while a python list is not.
# Create a 2d array from a list of lists
list2 = [[0,1,2], [3,4,5], [6,7,8]]arr2d = np.array(list2)arr2d
#> array([[0, 1, 2],
#> [3, 4, 5],
#> [6, 7, 8]])
You may also specify the datatype by setting the dtype argument. Some of the most commonly used numpy dtypes are: 'float'
, 'int'
, 'bool'
, 'str'
and 'object'.
# Create a float 2d arrayarr2d_f = np.array(list2, dtype='float')arr2d_f
#> array([[ 0., 1., 2.],
#> [ 3., 4., 5.],
#> [ 6., 7., 8.]])
You can also convert it to a different datatype using the astype
method.
# Convert to 'int' datatypearr2d_f.astype('int')
#> array([[0, 1, 2],
#> [3, 4, 5],
#> [6, 7, 8]])
However, if you are uncertain about what datatype your array will hold or if you want to hold characters and numbers in the same array, you can set the dtype
as 'object'
.
# Create an object array to hold numbers as well as stringsarr1d_obj = np.array([1, 'a'], dtype='object')arr1d_obj
#> array([1, 'a'], dtype=object)
Finally, you can always convert an array back to a python list using tolist()
.
arr1d_obj.tolist()#> [1, 'a']
To summarise, the main differences with python lists are:
- Arrays support vectorized operations, while lists don’t.
- Once an array is created, you cannot change its size. You will have to create a new array or overwrite the existing one.
- Every array has one and only one dtype. All items in it should be of that dtype.
- An equivalent numpy array occupies much less space than a python list of lists.
2. How to inspect the size and shape of an array?
# Create a 2d array with 3 rows and 4 columnslist2 = [[1, 2, 3, 4],[3, 4, 5, 6], [5, 6, 7, 8]]arr2 = np.array(list2, dtype='float')arr2
#> array([[ 1., 2., 3., 4.],
#> [ 3., 4., 5., 6.],
#> [ 5., 6., 7., 8.]])# shapeprint('Shape: ', arr2.shape)
# dtypeprint('Datatype: ', arr2.dtype)
# sizeprint('Size: ', arr2.size)# ndimprint('Num Dimensions: ', arr2.ndim)
#> Shape: (3, 4)
#> Datatype: float64
#> Size: 12
#> Num Dimensions: 2
3. How to extract specific items from an array?
arr2
#> array([[ 1., 2., 3., 4.],
#> [ 3., 4., 5., 6.],
#> [ 5., 6., 7., 8.]])
Unlike lists, numpy arrays can optionally accept as many parameters in the square brackets as there is a number of dimensions.
# Extract the first 2 rows and columnsarr2[:2, :2]list2[:2, :2] # error
#> array([[ 1., 2.],
#> [ 3., 4.]])
Additionally, numpy arrays support boolean indexing.
A boolean index array is of the same shape as the array-to-be-filtered and it contains only True and False values. The values corresponding to True positions are retained in the output.
b = arr2 > 4b
#> array([[False, False, False, False],
#> [False, False, True, True],
#> [ True, True, True, True]], dtype=bool)arr2[b]
#> array([ 5., 6., 5., 6., 7., 8.])
4. How to reverse the rows and the whole array?
Reversing an array works like how you would do with lists, but you need to do for all the axes (dimensions) if you want a complete reversal.
# Reverse only the row positionsarr2[::-1, ]
#> array([[ 5., 6., 7., 8.],
#> [ 3., 4., 5., 6.],
#> [ 1., 2., 3., 4.]])# Reverse the row and column positionsarr2[::-1, ::-1]
#> array([[ 8., 7., 6., 5.],
#> [ 6., 5., 4., 3.],
#> [ 4., 3., 2., 1.]])
5. How to represent missing values and infinite?
Missing values can be represented using np.nan
object, while np.inf
infinite. Let’s place some in arr2d.
# Insert a nan and an inf
arr2[1,1] = np.nan # not a numberarr2[1,2] = np.inf # infinitearr2
#> array([[ 1., 2., 3., 4.],
#> [ 3., nan, inf, 6.],
#> [ 5., 6., 7., 8.]])# Replace nan and inf with -1. Don't use arr2 == np.nanmissing_bool = np.isnan(arr2) | np.isinf(arr2)arr2[missing_bool] = -1 arr2
#> array([[ 1., 2., 3., 4.],
#> [ 3., -1., -1., 6.],
#> [ 5., 6., 7., 8.]])
6. How to compute mean, min, max on the ndarray?
# mean, max and minprint("Mean value is: ", arr2.mean())print("Max value is: ", arr2.max())print("Min value is: ", arr2.min())
#> Mean value is: 3.58333333333
#> Max value is: 8.0
#> Min value is: -1.0
However, if you want to compute the minimum values row wise or column wise, use the np.amin version instead.
# Row wise and column wise minprint("Column wise minimum: ", np.amin(arr2, axis=0))print("Row wise minimum: ", np.amin(arr2, axis=1))
#> Column wise minimum: [ 1. -1. -1. 4.]
#> Row wise minimum: [ 1. -1. 5.]# Cumulative Sumnp.cumsum(arr2)
#> array([ 1., 3., 6., 10., 13., 12., 11., 17., 22., 28., 35., 43.])
7. How to create a new array from an existing array?
If you just assign a portion of an array to another array, the new array you just created actually refers to the parent array in memory.
That means, if you make any changes to the new array, it will reflect in the parent array as well. So to avoid disturbing the parent array, you need to make a copy of it using copy()
. All numpy arrays come with the copy()
method.
# Assign portion of arr2 to arr2a. Doesn't really create a new
array.arr2a = arr2[:2,:2] arr2a[:1, :1] = 100 # 100 will reflect in arr2arr2
#> array([[ 100., 2., 3., 4.],
#> [ 3., -1., -1., 6.],
#> [ 5., 6., 7., 8.]])# Copy portion of arr2 to arr2barr2b = arr2[:2, :2].copy()arr2b[:1, :1] = 101 # 101 will not reflect in arr2arr2
#> array([[ 100., 2., 3., 4.],
#> [ 3., -1., -1., 6.],
#> [ 5., 6., 7., 8.]])
This covers the basic of numpy. There’s a great documentation of numpy official which can be found here. This completes the 1st part of numpy series. In the next part I will explicitly cover the functionalities necessary for the data analysis.