Fluent NumPy

Munish Goyal
Analytics Vidhya
Published in
32 min readJul 29, 2020

Let’s uncover the practical details of NumPy

Photo by Volodymyr Hryshchenko on Unsplash

Note to the Readers: Paying attention to comments in examples would be more helpful than going through the theory itself.

· Introduction to NumPy N-dimensional array (ndarray)
· NumPy Scalars and Data Types
· NumPy Array (ndarray) Creation
· NumPy Array (ndarray) Attributes
· Indexing ndarrays
· Crashing, Stacking and Splitting
· Broadcasting in NumPy
· NumPy Array (ndarray) Methods
· NumPy Universal Functions: Math, Floating, Trigno, Bitwise, etc
· NumPy Datetimes and Timedeltas
· Applying Functions to NumPy ndarray
· Arithmetic, matrix multiplication, and comparison operations
· Iterating over Arrays: Using nditer Iterator
· Masked Arrays

NumPy’s main object is the homogeneous multidimensional array. It is row/table/rectangular cuboid/etc. of elements (usually numbers), all of the same type, indexed by a tuple of positive integers.

Installation of NumPy:

The frequently used components of SciPy Stack are installed as:

python -m pip install --user numpy scipy matplotlib ipython jupyter pandas sympy nose python -m pip install --user scikit-learn

But, you might like to install just numpy to start with.

Introduction to NumPy N-dimensional array (ndarray)

An numpy.ndarray (it is type of the array objects) is a (usually fixed-size) multidimensional container of items of the same type and size (that is, homogeneous). The number of dimensions and items in an array is defined by its shape (which is a tuple) of N positive integers that specify the sizes of each dimension. The type of items in the array is specified by a separate data-type object (dtype), one of which is associated with each ndarray.

All ndarrays are homogenous: every item takes up the same size block of memory, and all blocks are interpreted in exactly the same way.

Check The N-dimensional array for the overview.

NumPy Scalars and Data Types

Refer topics: Scalars, Data Types, and Data Type Objets (dtype).

Python defines only one type of particular data class (there is only one integer type, one floating-type, etc.). In NumPy, there are 24 new fundamental Python types (numpy.dtype) to describe different types of scalars.

A data type object (an instance of numpy.dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. To describe the type of scalar data, there are several built-in scalar types in NumPy for various precision of integers, floating-point numbers, etc. An item extracted from an array, e.g., by indexing will be a Python object whose type is a scalar type associated with the data type of the array. Note that scalar types are not data types (dtype) objects, even though they can be used in place of one whenever a data type specification is needed in NumPy.

For constructing data types, refer Specifying and constructing data types.

Data types are objects, they have string representation that can be used in place of them as a value to dtype attribute, and they are similar to scalar types but not exactly the same. Also, data types can be used as a function to convert python numbers to array scalars. For example,

To determine the type of an NumPy array, look at the dtype attribute.

NumPy generally returns elements of arrays as array scalars (a scalar with an associated dtype). Array scalars differ from Python scalars, but for the most part, they can be used interchangeably. There are some exceptions, such as when code requires very specific attributes of a scalar or when it checks specifically whether a value is a Python scalar. Generally, problems are easily fixed by explicitly converting array scalars to Python scalars, using the corresponding Python type function (e.g., int, float, complex, str, unicode).

NumPy Array (ndarray) Creation

Refer Routines in NumPy Reference.

Creating NumPy arrays from existing data

Refer NumPy Arrays from existing Data.

CONVERTING PYTHON ARRAY_LIKE OBJECTS TO NUMPY NDARRAYS

In general, numerical data arranged in an array-like structure in Python can be converted to arrays through the numpy.array() function.

Its most useful form is:

# `dtype` is string or `dtype` object
numpy.array(object, dtype=None)

Here, object is array_like object (any object exposing array interface), and dtype is the desired data-type for the array. If not given, then the type will be determined as the appropriate type required to hold the objects in the sequence. This argument can be used generally to up-cast the array. If down-casting doesn’t work, then use the numpy.ndarray.astype() method.

For example,

Most useful form of numpy.ndarray.astype() method is:

# `dtype` is `dtype` object or its string representation
<ndarray>.astype(dtype)

For example,

READING NDARRAYS FROM DISK

Comma Separated Value (CSV) files are widely used. There are a number of ways of reading these files in Python. The most common way is to use numpy.fromfile() function.

I/O WITH NUMPY

NumPy provides several functions to create arrays from tabular data. One of the most important ones is genfromtxt.

In a nutshell, genfromtxt runs two main loops. The first loop converts each line of the file in a sequence of strings. The second loop converts each string to the appropriate data type. This mechanism is slower than a single loop but gives more flexibility. In particular, genfromtxt is able to take missing data into account, when other faster and simpler functions like loadtxt cannot.

Intrinsic NumPy Array Creation: Empty, Ones, Zeros, Sequences, and Matrices

Check Array creation routines in NumPy Reference.

Some examples:

Generating random samples using numpy.random

Refer to Random Sampling.

There are libraries that can be used to generate arrays for special purposes and it isn’t possible to enumerate all of them. The most common uses are use of the many array generation functions in random that can generate arrays of random values, and some utility functions to generate special matrices (e.g. diagonal).

For example,

NumPy Array (ndarray) Attributes

Array attributes reflect information that is intrinsic to the array itself. Generally, accessing an array through its attributes allows you to get and sometimes set intrinsic properties of the array without creating a new array. The exposed attributes are the core parts of an array and only some of them can be reset meaningfully without creating a new array. Information on each attribute is given below:

The following attributes contain information about the memory layout of the array:

The data type object associated with the array can be found in the dtype attribute.

Other attributes:

Indexing ndarrays

Array indexing refers to any use of the square brackets, [], to index array values.

Visualizing ndarray Indexing

RELATION OF NDARRAY DIMENSIONS WITH AXIS ORIENTATION, AXIS NAMES, AND AXIS NUMBERS

Remember that:

  • The innermost array elements in the n-dimensional array lay horizontally. The second innermost elements lay vertically, and so on. So, in an-dimentional array, axis=(n-2) represents y-axis, and axis=(n-1) always represents x-axis
  • Different points on the y-axis == different row numbers
  • Different points on the x-axis == different column numbers
  • 1D array: [column] == [x]
  • - array, with elements laying horizontally
  • - column (x) is axis=0
  • - same as how it is shown in Python console (horizontal array)
  • - with shape (length,) which is more like (1, length) rather than (length, 1)
  • 2D array: [row, column] == [y, x]
  • - an array of arrays, with innermost array elements laying horizontally
  • - row (y) is axis=0, column (x) is axis=1
  • - same as how it is shown in Python console (tabular, as shown in below figure)
  • 3D array: [height (z, towards inside the screen), row (y, towards down the screen), column (x, towards right-side of screen)] == [z, y, x]
  • - an array of arrays of arrays, with innermost array elements laying horizontally
  • - height (z) is axis=0, row (y) is axis=1, and column (x) is axis=2
  • - Python console first shows you a table at z=0, then at z=1, and so on
  • Consider X-Y plane to be like (table):
  • Note that the last index (the x-index) represents the most rapidly changing memory location. The second-last index is y-index, the third-last index is z-index, and so on.
  • While re-arranging data in different shapes, always consider X-axis as a unit place (10^0) of the numeric digit. Similarly, you may consider Y-axis as tens place (10^1) and Z-axis as hundredths place (10^2).
  • Note that the shape (x,) is not same as (x,1), but is more like (1, x), or (1, 1, x), and so on. You can think of these all of these 1s in beginning of the shape as redundant 0s that can be placed before a number.
  • The general form of the shape is: (..., z, y, x)

IMPORTANT FACTS ABOUT NDARRAY INDEXING

If we slice a multi-dimensional array with fewer indices than its dimensions, we get a sub-array with fewer dimensions. If we slice a n dimensional array with r indexes, then those r indexes will be considered as first r indexes (not last r indexes). For example, for a 3D array named x, the indexing x[2] will choose z-coordinate (that is, the first index) as 2.

As the result of slicing (this is a special kind of indexing), the returned sub-array is not a copy of the original, but it is a view and points to the same values in memory as does the original array.

We can assign same dimension array (or a smaller array which can be broadcasted to required dimensions, or even a single element (as it can be broadcasted to any required dimensions) to a slice (without saving it to a variable and using that variable), and hence alter the original array.

If we save that slice to some other variable, it will be still a view (which means that altering any of its elements will actually alter the element from the original array). This NumPy ndarray behavior is different from Python’s basic list. But, if we assign some new object (array, just one number, or anything) to the variable itself, then it is re-assignment to that variable (and it will lose its pointer to the slice object).

In case of indexing with index-arrays or indexing with boolean-arrays, the result (if you save) is a copy (not view) of original array. Also generally speaking, what is returned after indexing with index-arrays is of the same shape as index-arrays, but with type and values of the array being indexed.

Regarding peculiarities of how indexing is done with index-arrays and boolean-arrays, check corresponding sections.

One can select a subset of an array without saving subset to any other variable (and using that variable instead for assignment), to be assigned to directly, using some form of indexing such as a single index, slices, and index-arrays and boolean-arrays (yes, even with index-arrays, boolean-arrays, and single index). The value being assigned to the indexed array must be shape consistent (the same shape or broadcastable to shape the index produces). Refer section Assigning values to indexed arrays for an explanation on this concept.

Single element indexing for ndarrays

Single element indexing for a 1D array works exactly like that for other standard Python sequences. It is 0-based, and accepts negative indices for indexing from the end of the array.

Unlike list and tuples, numpy arrays support multi-dimensional indexing for multidimensional arrays.

Check following on how simple indexing works:

Slicing an ndarray

If we index a multi-dimensional array with fewer indices than its dimensions, we get a sub-array with fewer dimensions. Each index specified selects the array corresponding to the rest of the dimensions selected. For details, check Important facts about ndarray indexing section. For example,

Check below indexing as well:

SLICING-AND-STRIDING AN NDARRAY

It is possible to slice and stride (constructed by start:stop:step notion (with stop as non-inclusive) inside the square brackets) arrays to extract arrays of the same number of dimensions, but of different sizes than the original. The slicing and striding for NumPy arrays works exactly the same way it does for Python basic lists and tuples except that they can be applied to multiple dimensions as well. For example,

Indexing with Index-Arrays and Boolean-Arrays

It is possible to index NumPy arrays with another array for the purpose of selecting lists of values out of arrays into new arrays. Index arrays are a very powerful tool that allows one to avoid looping over individual elements in arrays and thus greatly improve performance. There are two ways to accomplishing this: one uses one or more arrays of index values (so these kinds of arrays are called index-arrays), the other involves giving a boolean-array of the proper shape to indicate the values to be selected.

INDEXING WITH INDEX-ARRAYS

In indexing an array with index-arrays what is returned is a copy of the original data, not a view as one gets for slices. This is due to the reason that indexing with index-arrays can not be considered as a slice, as indexes in the index-array may be out of order or even some indexes can appear multiple times. For details, check Important facts about ndarray indexing section.

Index-arrays must be of integer type 1D arrays. Each value in the array indicates which value in the array to use in place of the index for a given dimension. We have separate same-shaped index-array for each dimension, in a similar fashion to slicing, the first indexed array slices the leftmost dimension, and so on, and if the number of index arrays passed are less than dimensions of the original array then the left-out right-most dimensions are taken as complete slices (:).

Generally speaking, what is returned when index-arrays are used is a new array with the same shape as that of each index-array, but with the type and values of the target array being indexed. But, when one or more dimensions are slices (or complete slices), then the resultant shape is different as discussed in Combining Index-Arrays or Boolean-Arrays with Slices section.

In the case of Python list, to get items from given multiple indexes one would use slice notation (p:r), or have to use operator.itemgetter function.

Indexing 1D arrays with index-arrays:

Note that for np.ndarray, x[1:4] (indexing outer-most dimension with 1:4) is similar (but not same) as x[[1, 2, 3]] but is entirely different from x[1, 2, 3] (slicing with index 1 for outer-most, 2 for 2nd outer-most, and 3 for 3rd outer-most dimension).

Check the following examples:

Negative index values are permitted and work as they do with single indices or slices.

It is an error to have index values out of bounds:

Indexing multi-dimensional arrays using index-arrays:

There are two ways in which index-arrays can be used to index multi-dimensional arrays (concepts are already covered earlier): — using one 1D index-array for any one of the target array dimension (and using : for other dimensions): the result is slice-like array (but which is actually a copy rather than a view of target array) with dimension specified reduced in size - using 1D index-arrays of all of same shape or broadcasted same shape (such as [x1, x2, x3], [y1, y2, y3], and [z1, z2, z3] for each target array dimension: results in 1D array (copy) of length equal to that of the size of each 1D index-array (such as with elements corresponding to indexes [z1, y1, x1], [z2, y2, x2], and [z3, y3, z3]

Another example,

Another example of indexing multi-dimensional array with an array of indices:

It is possible to only partially index an array with index-arrays. For example,

What results is the construction of a new array where each value of the index-array selects one row from the array being indexed and the resultant array has the resulting shape (number of index elements, size of the row).

In general, the shape of the resultant array will be the concatenation of the shape of the index-array (or the shape that of all the index-array were broadcast to) with the shape of any unused dimensions (those not indexed) in the array being indexed.

INDEXING WITH BOOLEAN-ARRAYS (OR MASK-INDEX-ARRAYS)

Similar to index-arrays, when boolean-arrays are used for individual dimensions:each boolean-array must be 1D array — the length of each boolean array should match with the corresponding dimension for which it is usedall boolean-arrays must have the same number of True values (similar to same shaped index-array) - rightmost dimensions implicitly use slice if the number of passed boolean-arrays is lesser than the number of dimensions of target array - produced output is: - if slices are not used (implicit or explicit; that is, there is boolean-array for each dimension) for any dimension, then output is a new 1D array with the same size as that of number of True in each boolean-array - if slices are used for any dimension(s), then output array (copy) will have a shape as discussed in Combining Index-Arrays or Boolean-Arrays with Slices section

Apart from this, unlike index-arrays, boolean-array of the same shape as that of target array can be used for indexing: the result is 1D array (copy) of values from target array for which the corresponding index in boolean-array is True.

For example,

Note that the length of the 1D boolean array must match with the length of the dimension (or axis) you want to slice.

In indexing an array with boolean-arrays, as with index-arrays, what is returned is a copy of the original data, not a view as one gets for slices. For details, check Important facts about ndarray indexing section.

In general, when the boolean array has fewer dimensions than the array being indexed, this is equivalent to y[b, ...], which means y is indexed by b followed by as many : as are needed to fill out the rank of y. Thus the shape of the result is one dimension containing the number of True elements of the boolean array, followed by the remaining dimensions of the array being indexed.

Combining Index-Arrays or Boolean-Arrays with Slices

Index-arrays may be combined with slices. The slice notation (such as p:r can be used instead of index-arrays, and it also gives the flexibility of size for that particular dimension so that it needs not to match the size of other index-arrays.

If the slice is used for at least one dimension, then the output array has the same number of dimensions as that of the original target array (but with the reduced shape).

This case can be thought of as y[:, 1:3][[0, 2, 4), :], which means that first getting a slice with y[:, 1:3], and then indexing that array with [[0, 2, 4], :].

Likewise, slicing can be combined with boolean-arrays:

Now, consider a below mind-blowing example:

Fluidity: Determining the shape of an index operation on ndarray

Let’s discuss how to determine the dimension of the output array, which is a bit tricky. We’ll be taking the example of 3D array and we are going to select elements with given z, y, and x values ([z, y, x]).

Here are some Rules:

  • Rule: using the scalar index, or complete slice, : (only :, with no starting and ending position) makes that dimension fluid.
  • Rule: when fewer indices are provided than the number of axes, the missing indices (starting from x, then y, and so on..) are considered complete slices, :.
  • Definition: fluid(r) is defined as any integer between mathematical range [0, r], the value for a dimension, whatever fits the final shape best (the one which is the simplest shape).
  • Rule: the simplest shape is the one which has the least number of dimensions with higher values towards the right, out of all possibilities.
  • Rule: the order in which selected elements are filled in a mold is: first elements at x-axis (the right-most axis), then at the y-axis, and then at a z-axis, and so on.
  • Rule: while filling up elements, initially consider the fluid dimensions to be as large as corresponding original array, then trim un-used size of fluid dimensions.
  • Rule: Aim is to find the simplest shape from allowed shapes; for a dimension, the maximum value can be as large as the corresponding original target array dimension.
  • Note: for dimension with scalar index, its final size is fluid(1) (that is 0 or 1).
  • Note: for dimension with : as index, its final size is fluid(size of correesponding dimension in original array).

Concepts as an example:

Structural indexing tools: numpy.newaxis, and Ellipsis

To facilitate the easy matching of array shapes with expressions and in assignments, the np.newaxis the object can be used within array indices to add new dimensions with a size of 1. For example,

Note that in the above examples, there are no new elements in the array, but just the dimensionality is increased. This can be handy to combine two arrays in a way that otherwise would require explicitly reshaping operations. Also, the reshaping operation might not be that intuitive. For example,

The Ellipsis (or ...), syntax may be used to indicate selecting in full any remaining unspecified dimensions.

For example,

Dealing with variable numbers of indices within programs: Tuples, np.nonzero(), Slices, and Ellipsis

The index which is passed to NumPy’s ndarray, is basically a tuple. For example,

z = np.arange(81).reshape(3,3,3,3)
indices = (1,1,1,1)
z[1, 1, 1, 1] #=> 40
z[(1, 1, 1, 1)] #=> 40 # Pay heed here
z[indices] #=> 40 # Interesting
z[*indices] # SyntaxError: invalid syntax

So one can use code to construct tuples of any number of indices and then use these within an index.

Slices can be specified within programs by using the slice() function of Python. For example,

indices = (1,1,1,slice(0,2))      # same as [1,1,1,0:2]
z[indices] #=> array([39, 40])

Similarly, you can use slice(0, 10, 3) instead of 0:10:3.

Likewise, the ellipsis can be specified by code using the Ellipsis (or ...) object:

indices = (1, Ellipsis, 1)        # same as [1,...,1]
z[indices]
# array([[28, 31, 34],
# [37, 40, 43],
# [46, 49, 52]])

For this reason, it is possible to use the output from the np.nonzero() function directly as an index since it always returns a tuple of index arrays. For example,

As can be seen above, nonzero() can also be called as a method of ndarray.

Another example:

Refer Searching for other searching functions.

Because of the special treatment of tuples, they are not automatically converted to an array as a list would be. For example,

z = np.arange(27).reshape(3,3,3)z
# array([[[ 0, 1, 2],
# [ 3, 4, 5],
# [ 6, 7, 8]],
#
# [[ 9, 10, 11],
# [12, 13, 14],
# [15, 16, 17]],
#
# [[18, 19, 20],
# [21, 22, 23],
# [24, 25, 26]]])
z[(1, 1, 1)] #=> 13z[[1, 1, 1]]
# array([[[ 9, 10, 11],
# [12, 13, 14],
# [15, 16, 17]],
#
# [[ 9, 10, 11],
# [12, 13, 14],
# [15, 16, 17]],
#
# [[ 9, 10, 11],
# [12, 13, 14],
# [15, 16, 17]]])

Assigning values to indexed arrays in NumPy

We discussed about the assignment to sub-set of an array in Important facts about ndarray indexing section. Let’s see it with examples:

For example,

Note that assignments may result in changes if assigning higher types to lower types (like floats to ints) or even exceptions (assigning complex to floats or ints).

Unlike some of the references (such as an array-indexes and boolean-indexes and of course, single index) assignments are always made to the original data in the array. Some actions may not work as one may naively expect, though. For example,

Crashing, Stacking and Splitting

Crashing an ndarray along a given axis

Crashing axis=r while using aggregating functions:

Stacking ndarrays along a given axis

Stacking horizontally (appending columns) and vertically (appending rows) 1D and 2D arrays using numpy.hstack() and numpy.vstack() functions:

Stacking multi-dimensional arrays using numpy.concatenate() along with a given axis:

Splitting an ndarray

Splitting an array along with a given axis into a given number of sub-arrays with near-equal size can be done with numpy.array_split():

Broadcasting in NumPy

Refer: https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations.

During an operation on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions and works its way forward. Two dimensions are compatible when: 1. they are equal, or 2. one of them is 1

Note: In case of R language, broadcasting even works when the size of one of the array is some multiple of another (such as first is with shape (2,) and the second one with (4,)), but this doesn’t work in Python.

But, note that arrays do not need to have the same number of dimensions. For example, a 1D array of size (n,) can be considered as 2D array of size (1,n), or 3D array of size (1,1,n), and so on. Similar, a scalar value can be considered as 1D array of shape (1,), or 2D array of shape (1,1), or 3D array of shape (1,1,1), and so on.

If these conditions are not met, a ValueError: frames are not aligned the exception is thrown, indicating that the arrays have incompatible shapes.

The size of the resulting array is the maximum size along each dimension of the input arrays.

Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does it without making needless copies of data and usually leads to efficient algorithm implementations.

NumPy arithmetic operations are done on pairs of arrays on an element-by-element basis. In the simplest case, the two arrays must have exactly the same shape. But, NumPy’s broadcasting rule relaxes this constraint when the arrays’ shapes meet certain constraints. The simplest broadcasting example occurs when an array and a scalar value are combined in an operation.

For example,

Notice that both the results are equivalent. We can think of scalar 2 being stretched during the arithmetic operation into an array with the same shape as a. The stretching analogy is only conceptual. NumPy is smart enough to use the original scalar value without actually making copies so that broadcasting operations are as memory and computationally efficient as possible.

Some examples of compatible arrays and size of result (for arithmetic operation):

A:      256 x 256 x 3
B: 3
---------------------
Result: 256 x 256 x 3
---------------------
A: 8 x 1 x 6 x 1
B: 7 x 1 x 5
---------------------
Result: 8 x 7 x 6 x 5
---------------------
A: 15 x 3 x 5
B: 3 x 1
---------------------
Result: 15 x 3 x 5
---------------------

For example,

Broadcasting provides a convenient way of taking the outer product (or any other outer operation) of two arrays. The following example shows an outer addition operation of two 1D arrays:

a = np.array([0.0, 10.0, 20.0, 30.0])
b = np.array([1.0, 2.0, 3.0])
a[:, np.newaxis] + b
array([[ 1., 2., 3.],
[ 11., 12., 13.],
[ 21., 22., 23.],
[ 31., 32., 33.]])

Here newaxis index operator inserts a new axis into a, making it a 2D 4x1 array. Combining the 4x1 array with b, which has shape (3,1), yields 4x3 array.

NumPy Array (ndarray) Methods

Refer Array Methods in NumPy Reference.

View (view()) and Copy (copy()) for ndarray

The view() method creates a new array object that looks at the same data.

Slicing an array returns a view of it.

The copy() method makes a complete copy of the array and its data.

Indexing using index-arrays or boolean-arrays returns a new array. For details check sections Index with Index-Arrays and Boolean-Arrays and Assigning values to indexed arrays.

Shape Manipulation of ndarray

Refer: https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html#shape-manipulation

For reshape, resize, and transpose, the single tuple argument may be replaced with n integers which will be interpreted as n-tuple.

Following are shape manipulation related attributes and methods:

Concepts as an example:

Other ndarray Methods

Refer Array Methods in NumPy Reference.

NumPy Universal Functions (ufunc): Math, Floating, Trigno, Bitwise, Comparison Functions

Refer ufunc reference in NumPy Reference.

A universal function (or ufunc for short) is a function that operates on ndarrays in an element-by-element fashion, supporting array broadcasting, typecasting, and several other standard features. That is, a ufunc is a “vectorized” wrapper for a function that takes a fixed number of specific inputs and produces a fixed number of specific outputs.

In NumPy, universal functions are instances of numpy.ufunc class. Many of the built-in functions are implemented in compiled C code.

NumPy Datetimes and Timedeltas

Refer Datetimes and Timedeltas section from NumPy Reference.

Applying Functions to NumPy ndarray

The numpy.apply_along_axis function

Its general form is:

numpy.apply_along_axis(func1d, axis, arr, *args, **kwargs)

The function apply_along_axis applies the function func1d(arr, *args, **kwargs) to 1D slices along the given axis, and returns a new array.

For example,

The numpy.apply_over_axes function

Its general form is:

numpy.apply_over_axes(func, arr, axes)

The function apply_over_axes applies the function, func(arr, axis) repeatedly over multiple axes given by array/tuple axes.

For example,

Arithmetic, matrix multiplication, and comparison operations

Refer Arithmetic, matrix multiplication, and comparison operations from NumPy Reference.

Suppose you need to perform matrix multiplication for two matrices of size 3x2 and 4x2, you would have two options: - take transform of the second matrix and perform np.matmul(a, b) - take the transform of the first matrix and perform np.matmul(b, a)

The result of both the operations will have the same data but just transpose of each other. But note that in np.matmul(A, B), if data in A is arranged row-wise, then data in B should be arranged in column-wise (or visa-versa). So, you can safely use a transpose in a matrix multiplication if the data in both of your original matrices are arranged as rows.

Let’s say you have the following two matrices, called inputs and weights:

Here, we can obtain matrix multiplication as:

The two answers are transpose of each other, so which multiplication you use really just depends on the shape you want for the output.

Iterating over Arrays: Using nditer Iterator

Refer Iterating Over Arrays from NumPy Reference.

The numpy.nditer is an efficient multi-dimensional iterator object to iterate over arrays.

Clearing “ordering” confusion

Refer Multidimensional Array Index Ordering Issue from NumPy Reference.

There are two conflicting conventions for indexing 2-dimensional arrays:Matrix notation (used by Python programmers) uses the first index to indicate which row (y) is being selected and the second index to indicate which column (x) is selected. - The matrix notation is opposite to the geometrically oriented-convention for images where people generally think the first index represents x position (i.e, column) and the second represents y position (i.e., row).

This alone is the source of much confusion: matrix-oriented users (Python programmers) and image-oriented users expect two different things with regards to indexing.

Order in which array is stored in memory: — In Fortran, the first index is the most rapidly varying index when moving through the elements of a two-dimensional array as it is stored in memory. If you adopt the matrix convention for indexing, then this means the matrix is stored one column at a time. Thus Fortran is considered a Column-major language. — C has just the opposite convention. In C, the last index changes most rapidly as one moves through the array as stored in memory. Thus C is Row-major language. The matrix is stored by rows.

Note that in both cases (‘F’ order or ‘C’ order) it presumes that the matrix convention for indexing is being used, i.e, for both Fortran and C, the first index is the row for a 2D array.

The internal machinery of NumPy array is flexible enough to accept any ordering of indices. One can simply reorder indices by manipulating the internal stride information for arrays without reordering the data at all. NumPy will know how to map the new index order to the data without moving the data.

So, if this is true, why not choose the index order that matches what you most expect? The drawback of doing this is potential performance penalties. It’s common to access the data sequentially, either implicitly in array operations or explicitly by looping over rows of an image. When that is done, then the data will be accessed in non-optimal order. As the first index is incremented, what is actually happening is that elements spaced far apart in memory are being sequentially accessed, with usually poor memory access speeds.

The Basic Iteration

Consider the following example:

By default (order='K') the order is chosen to match to keep the memory layout of the array instead of using standard C or Fortran ordering. This is done for access efficiency, reflecting the idea that by default one simply wants to visit each element without concern for a particular ordering.

Note: I think, the default order is the same as standard C order.

By default, the nditer treats the input array as a read-only object. To modify the array elements, you must specify the either read-write or write-only mode. This is controlled with op_flags flags.

Regular assignments in Python simply changes the reference of the local or global variable dictionary instead of modifying an existing variable in place. This means that simply assigning to x will not place the value into the element of the array, but rather switch x from being an array element reference to being a reference to the value you assigned. To actually modify the elements of the array, x should be indexed with the ellipsis, x[...]. For example,

Using an External Loop, 'external_loop'

In all the examples so far, the elements of ndarray are provided by the iterator one at a time, because all looping logic is internal to the iterator. While this is simple and convenient, it is not very efficient. A better approach is to move the one-dimensional innermost loop into your code, external to the iterator. This way, the NumPy’s vectorized operations can be used on larger chunks of the elements being visited.

The nditer will try to provide chunks that are as large as possible to the inner loop. By forcing 'C' and 'F' order, we get different external loop sizes.

In the below example, observe that with the default of keeping native memory order, the iterator is able to provide a single one-dimensional chunk, whereas when forcing Fortran order, it has to provide three chunks of two elements each (because this is not how the data is saved in memory).

Buffering, 'buffered' the Array Elements

When forcing an iteration order, we observed that the external loop option may provide the elements in smaller chunks because the elements can’t be visible in the appropriate order with a constant stride. When writing C code, this is generally fine, however, in pure Python code, this can cause a significant reduction in performance.

By enabling buffering mode, the chunks provided by the iterator to the inner loop can be made larger, significantly reducing the overhead of the Python interpreter.

In the below example, forcing Fortran order, the inner loop gets to see all the elements in one go when buffering is enabled.

Tracking an index or Multi-index

During iteration, you may want to use the index of the current element in a computation. For this numpy introduces an alternative syntax for iterating with an nditer. This syntax explicitly works with the iterator object itself, so its properties are readily accessible during iteration. With this looping construct, the current value is accessible by indexing into the iterator, and the index being tracked is the property index or multi_index depending on what was requested. For example,

Tracking an index or multi-index is incompatible with using an external loop because it requires a different index value per element. If you try to combine these flags, the nditer object will raise an exception ValueError: Iterator flag EXTERNAL_LOOP cannot be used if an index or multi-index is being tracked.

Broadcasting Array Iteration

Refer topic Broadcasting Array Iteration from Iterating Over Arrays.

Masked Arrays

Refer: https://docs.scipy.org/doc/numpy/reference/maskedarray.html

Masked arrays are arrays they may have missing or invalid entries. The numpy.ma module provides a work-alike replacement for NumPy that supports data arrays with masks.

A masked array is the combination of a standard numpy.ndarray and a mask. A mask is either nomask, indicating that no value of the associated array is invalid, or an array of booleans that determine for each element of the associated array whether the value is valid or not. When an element of the mask is False, the corresponding element of the associated array is valid and is said to be unmasked. When an element of the mask is True, the corresponding element of the associated array is said to be masked (invalid). The package ensures that masked entries are not used in computations.

For example,

The numpy.ma.core.MaskedArray class

The numpy.ma.core.MaskedArray is a subclass of ndarray designed to manipulate numerical arrays with missing data. An instance of MaskedArray can be thought of as the combination of several elements: data, mask, and fill_value.

Attributes and properties of masked arrays:

MaskedArray.data:

Returns the underlying data, as a view of masked arrays. If the underlying data is a subclass of numpy.ndarray, it is returned as such.

MaskedArray.baseclass:

Returns the class of the underlying data.

MaskedArray.mask:

Returns the underlying mask, as an array with the same shape and structure as the data, but where all fields are atomically booleans. A value of True indicates an invalid entry.

MaskedArray.fill_value:

Returns the value used to fill the invalid entries of a masked array. The value is either a scalar (if the masked array has no named fields), or a 0-D ndarray with the same dtype as the masked array if it has named fields.

The default filling value depends on the data type of the array:

Constructing Masked Arrays

USING NUMPY.MA.MASKED_ARRAY

The numpy.ma.masked_array is alias of numpy.ma.core.MaskedArray.

USING NUMPY.MA.ARRAY

It is an array class with possibly masked values.

Its most general form is:

USING VIEW OF EXISTING ARRAY

Yet another option is to take the view of an existing array. In that case, the mask of the view is set to nomask if the array has no named fields or an array of boolean with the same structure as the array otherwise.

For example,

Accessing data of a MaskedArray

The underlying data of a masked array can be accessed in several ways:

  • through the data attribute. The output is a view of the array as a numpy.ndarray or one of its subclasses, depending on the type of the underlying data at the masked array creation
  • through the __array__(dtype) method. It returns either a new reference to self if dtype is not given or a new array of provided data type if dtype is different from the current dtype of the array
  • by directly taking a view of the masked arrays as a numpy.ndarray or one of its subclass (which is actually what using the data attribute does)
  • by using the getdata(a, subok=True) function

Where a representation of the array is required without any masked entries, it is recommended to fill the array with the filled method.

Accessing the mask

The mask of a masked array is accessible through its mask attribute. We must keep in mind that a True entry in the mask indicates invalid data.

Another possibility is to use the getmask and getmaskarray functions.

Accessing only the valid entries

To retrieve only the valid entries, we can use the inverse of the mask as an index. The inverse of the mask can be calculated with the numpy.logical_not function or simply with the ~ operator. For example,

Another way to retrieve the valid data is to use the compressed method, which returns a one-dimensional ndarray (or one of its sub-classes, depending on the value of baseclass attribute). For example,

x.compressed()          #=> array([1, 4])

Note that output of compressed is always 1D.

Modifying the Mask

The recommended way to mask one or several specific entries of a masked array is invalid is to assign the special value numpy.ma.masked to them.

To unmask one or several entries, we can just assign one or several new valid values to them.

Now let’s try with a masked array having mask even in the beginning:

An alternative approach is to modify the mask directly, but this usage is discouraged. When creating a new masked array with a simple, non-structured datatype, the mask is initially set to the special value nomask, that corresponds roughly to the boolean False. Trying to set an element of nomask will fail with TypeError exception, as a boolean does not support item assignment.

All the entries of an array can be masked at once by assigning True to the mask (for example, x.mask = True). To unmask all masked entries of a masked array, the simples solution is to assign the constant numpy.ma.nomask to the mask (for example, x.mask = np.ma.nomask).

Unmasking an entry by assigning a valid value will silently fail if the masked array has a hard mask, as shown by the hardmask attribute. This feature was introduced to prevent overwriting the mask. To force the unmasking of an entry where the array has a hard mask, the mask must first to be softened using the soften_mask() method before the allocation (for example, x.soften_mask()). It can be re-hardened with harden_mask() method (for example, x.harden_mask()).

Indexing and Slicing a MaskedArray

As a MaskedArray is a subclass of numpy.ndarray, it inherits its mechanisms for indexing and slicing.

When accessing a single entry of a masked array with no named fields, the output is either a scalar (if the corresponding entry of the mask is False) or the special value numpy.ma.masked (if the corresponding entry of the mask is True).

When accessing a slice, the output is a masked array whose data attribute is a view of original data, and whose mask is either numpy.ma.nomask (if there were no invalid entries in the original array) or a view of the corresponding slice of the original mask. The view is required to ensure the propagation of any modification of the mask to the original. Check examples in Modifying the Mask for an explanation.

If the masked array has named fields, then:

Operations on masked arrays

Arithmetic and comparison operations are supported by masked arrays. As much as possible, invalid entries of a masked array are not processed, meaning that the corresponding data entries should be the same before and after the operation. But remember that this behavior may not be systematic, that masked data may be affected by the operation in some cases and therefore users should not rely on this data remaining unchanged.

For example,

The numpy.ma module comes with a specific implementation of most ufuncs. Unary and binary functions that have a validity domain (such as log or divide) return the numpy.ma.masked constant whenever the input is masked or falls outside the validity domain.

For example,

Masked arrays also support standard numpy ufuncs. The output is then a masked array. The result of a ufunc is masked wherever the input is masked. The result of binary ufunc is masked wherever any of the input is masked. If the ufunc also returns the optional context output (a 3-element tuple containing the name of the ufunc, its arguments and its domain), the context is processed and entries of the output masked array are masked wherever the corresponding input falls outside the validity domain.

MaskedArray methods

Refer MaskedArray methods and Masked array operations.

Here are some related interesting stories that you might find helpful:

--

--

Munish Goyal
Analytics Vidhya

Designing and building large-scale data-intensive cloud-based applications/APIs.