BRUSH UP YOUR NUMPY SKILLS

Published in

Analytics Vidhya

5 min readJul 2, 2021

Data science is concerned with structured data tables in proportion. The scikit-learn package requires two-dimensional NumPy arrays as input tables.

In this article, we’re going to revise the very basic concepts of the NumPy library that can become a helping hand in bigger projects.

Shape and Dimensions of NumPy Arrays

To begin, import NumPy:

import numpy as np

2. Construct a 10-digit NumPy array, equivalent to Python’s range(15) technique:

np.arange(10)
 
Output: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

3. With only one pair of parentheses, the array resembles a Python list. This indicates that it only has one dimension. Determine the shape by storing the array.

array_one = np.arange(10) 
array_one.shape Output: (10,)

4. Shape is a data attribute in the array. In the array_one, the shape is a tuple (10,) with length 1. The length of the tuple determines the number of dimensions, which is one in this situation:

array_one.ndim   #Find total number of dimensions of array_oneOutput: 1

5. There are ten entries in the array. Call the reshape mechanism to reshape the array:

array_one.reshape((5,2)) Output: array([[0, 1],  
               [2, 3],  
               [4, 5], 
               [6, 7],  
               [8, 9]])

6. The array is reshaped into a 5 x 2 data item that looks like a list of lists (a three-dimensional NumPy array resembling a list of lists of lists). The modifications were not stored. Just save reshaped array in the following format:

array_one = array_one.reshape((5,2))

7. It’s worth noting that array_one has become two-dimensional. This is to be expected, given that its shape contains two numbers and resembles a Python list of lists:

array_one.ndim Output: 2

Broadcasting in NumPy

8. By broadcasting, you can add any number (let’s take 1) to each array element. It’s important to note that updates to the array aren’t saved:

array_one + 1Output: array([[ 1, 2],  
               [ 3, 4], 
               [ 5, 6],
               [ 7, 8], 
               [ 9, 10]])

The shorter array is stretched or broadcast across the bigger array, which is referred to as “Broadcasting”.

9. Make a new array called array_two. Examine what happens when you multiply the array by itself (this is element-wise array multiplication, not matrix multiplication):

array_two = np.arange(10)
array_two * array_twoOutput: array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])

10. Each component has been squared. Here, element-by-element multiplication has taken place. Here’s an illustration that’s a little more complicated:

array_two = array_two ** 2 
#Note that this is equivalent to array_two * array_two 
array_two = array_two.reshape((5,2)) 
array_twoOutput: array([[ 0, 1],  
               [ 4, 9],
               [16, 25], 
               [36, 49], 
               [64, 81]])

11. Let’s modify array_one too:

array_one = array_one + 1 
array_oneOutput:array([[ 1, 2],  
              [ 3, 4],  
              [ 5, 6], 
              [ 7, 8],
              [ 9, 10]])

12. By just putting an addition sign between both the arrays, you can now add array_one and array_two element by element:

array_one + array_two Output: array([[ 1, 3], 
               [ 7, 13],
               [21, 31],
               [43, 57],
               [73, 91]])

Initialization of dtypes and NumPy arrays

Aside from np.arange, there are various other ways to initialize NumPy arrays:

13. Using np.zeros, create a zeros array. The command np.zeros((2,5)) produces a 2 x 5 array of zeros:

np.zeros((2,5))Output: array([[0., 0., 0., 0., 0.],
               [0., 0., 0., 0., 0.]])

14. Using np.ones, create an array of ones. To verify that the ones are of NumPy integer type, add a dtype argument with the value np.int. It’s worth noting that scikit-learn requires arrays to have np.float parameters. Each element in a NumPy array has a type, which is specified by the dtype. It’s the same every way through the array. The np.int integer type is used for each element in the array as follows:

np.ones((2,5), dtype = np.int)Output: array([[1, 1, 1, 1, 1],
               [1, 1, 1, 1, 1]])

15. To allocate memory for an array of a certain size and dtype but no specific filled values, use np.empty:

np.empty((2,5), dtype = np.float)Output: array([[0., 0., 0., 0., 0.],
               [0., 0., 0., 0., 0.]])

16. To allocate memory for NumPy arrays with the varying previous values, use np.zeros, np.ones, and np.empty.

NumPy Indexing

17. Using indexing, retrieve the values of the two-dimensional arrays:

array_one[0,0] #Finds value in first row and first columnOutput: 1

18. Look at the first row:

array_one[0,:] Output: array([1, 2])

19. Now, let’s look at the 1st column:

array_one[:,0] Output: array([1, 3, 5, 7, 9])

20. Specific values can be seen on both axes. Also have a look at the second until fourth rows:

array_one[2:5, :] Output: array([[ 5, 6], 
               [ 7, 8], 
               [ 9, 10]])

21. Only look at the second to fourth rows in the first column:

array_one[2:5,0] Output: array([5, 7, 9])

Arrays of Booleans

NumPy also uses Boolean logic to control indexing:

22. Make a Boolean array initially:

array_one > 5Output: array([[False, False], 
               [False, False],
               [False, True], 
               [ True, True],
               [ True, True]], dtype=bool)

23. Placing parentheses all around Boolean array will allow you to filter by it:

array_one[array_one > 5] Output: array([ 6, 7, 8, 9, 10])

Operations based on Arithmetic

24. The sum method adds all of the array’s items together. Return to array_one:

array_oneOutput: array([[ 1, 2], 
               [ 3, 4], 
               [ 5, 6],
               [ 7, 8], 
               [ 9, 10]])array_one.sum()Output: 55

25. By row, find all the sums:

array_one.sum(axis = 1)  Output: array([ 3, 7, 11, 15, 19])

26. To find all the sums by column, use the following formula:

array_one.sum(axis = 0) Output: array([25, 30])

27. In a similar fashion, calculate the mean of each column. It’s worth noting that the averages array’s dtype is np.float:

array_one.mean(axis = 0)Output: array([ 5., 6.])

Null or NaN values

28. np.nan values are not accepted by Scikit-learn. Take array_three for example:

array_three = np.array([np.nan, 0, 1, 2, np.nan])

29. The np.isnan method creates a specific Boolean array that can be used to find NaN values:

np.isnan(array_three)Output: array([ True, False, False, False, True], dtype=bool)

30. Eliminate the NaN values by negating the Boolean array with the sign ~ and surrounding the expression with brackets:

array_three[~np.isnan(array_three)] Output: array([ 0., 1., 2.])

31. Set the NaN values to zero as an option:

array_three[np.isnan(array_three)] = 0 
array_threeOutput: array([ 0., 0., 1., 2., 0.])

Conclusion

Data, in its most basic form, consists of 2D tables of numbers, which NumPy excels at handling. Take this into consideration if you lose track of the NumPy syntax, only 2D NumPy arrays of absolute values with no missing np.nan values are accepted by Scikit-learn.

Changing np.nan to a value rather than passing up data seems to work best in my opinion. Its preferable to keep track of Boolean expressions and keep the data form consistent, as this results in fewer coding errors and therefore more coding versatility.