Essential Python for Machine Learning: NumPy

The Multi-dimensional Array Engine

Dagang Wei
10 min readDec 30, 2023

This article is part of my book Essential Python for Machine Learning.

Introduction

In the vast landscape of data science and numerical computing, NumPy stands as a cornerstone library, providing a powerful multi-dimensional array object and a myriad of functions for mathematical operations. Whether you’re a seasoned data scientist, machine learning engineer or a beginner in the world of programming, understanding NumPy is essential for efficient numerical computations in Python.

What is NumPy?

NumPy, short for Numerical Python, is an open-source library that adds support for large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays. Released in 2006, NumPy has since become an indispensable tool for scientists, researchers, and engineers working in fields such as machine learning, physics, finance, and more.

NumPy is implemented in a combination of Python and C. The core numerical operations in NumPy are implemented in C and are highly optimized for performance. These C implementations are used to carry out array manipulations, linear algebra operations, and other numerical computations efficiently. The integration of C code with Python is facilitated by the Python/C API, allowing seamless interaction between the two languages. This combination of Python for high-level functionality and C for performance-critical operations makes NumPy a powerful and efficient library for numerical computing in Python.

Why NumPy?

NumPy offers several compelling reasons for its widespread adoption in the scientific computing community:

  • Performance: NumPy operations are highly optimized and executed at compiled speed, making them significantly faster than their pure Python counterparts. This is crucial for handling large datasets and computationally intensive tasks.
  • Memory Efficiency: NumPy arrays are more memory-efficient compared to Python lists, primarily because they store elements of the same data type, allowing for better utilization of memory.
  • Compatibility: NumPy seamlessly integrates with other libraries and frameworks in the Python ecosystem, such as SciPy, pandas, and scikit-learn, fostering a cohesive environment for scientific computing.
  • Versatility: From basic mathematical operations to complex linear algebra and statistical functions, NumPy provides a versatile set of tools for a wide range of applications.

Place in the Ecosystem

NumPy serves as a cornerstone in the Python Data Science and Machine Learning ecosystem by providing efficient data structures and mathematical functions that are essential for numerical computing. It forms the basis for many other libraries and tools, making it a fundamental component of the Python scientific computing stack.

Core Data Structure: Multi-dimensional Array

The fundamental building block of NumPy is the ndarray (n-dimensional array), which allows efficient storage and manipulation of homogeneous data. NumPy operations are implemented in C and Fortran, making them inherently faster than equivalent operations in pure Python, especially when working with large datasets.

Source: https://www.oreilly.com/library/view/elegant-scipy/9781491922927/ch01.html

Key characteristics

Here are key characteristics and features of the NumPy ndarray:

  • N-dimensional: NumPy arrays can have any number of dimensions (1D, 2D, 3D, etc.), making them suitable for representing a wide range of data, including vectors, matrices, and higher-dimensional tensors.
  • Homogeneous Data Type: All elements in a NumPy array must have the same data type, which ensures uniformity and allows for optimized storage and computations.
  • Fixed Size: Once created, the size (shape) of a NumPy array is fixed and cannot be changed. To modify the size, you need to create a new array.
  • Indexing and Slicing: NumPy arrays support advanced indexing and slicing operations, allowing for flexible extraction of subsets of data. This includes boolean indexing, integer array indexing, and more.
  • Broadcasting: NumPy arrays support broadcasting, a powerful feature that allows operations on arrays of different shapes and sizes. Smaller arrays are automatically expanded to match the shape of larger arrays.
  • Vectorized Operations: NumPy supports vectorized operations, enabling element-wise operations and mathematical computations on entire arrays without the need for explicit looping. This leads to concise and efficient code.
  • Memory Efficiency: NumPy arrays are memory-efficient due to their homogeneous nature. They store data in a contiguous block of memory, and the memory layout is optimized for quick access and efficient operations.
  • Optimized for Numerical Computations: NumPy provides a wide range of optimized mathematical functions and operations, making it particularly well-suited for numerical computing, scientific simulations, and data analysis.
  • Linear Algebra Operations: NumPy includes a numpy.linalg module for linear algebra operations, such as matrix multiplication (numpy.dot()), eigenvalue computation (numpy.linalg.eig()), and solving linear systems (numpy.linalg.solve()).
  • Data Persistence: NumPy provides functions for efficiently saving and loading arrays from disk, allowing for easy data persistence. The numpy.save() and numpy.load() functions can be used for this purpose.
  • Integration with Other Libraries: NumPy arrays integrate seamlessly with other scientific computing libraries in Python, such as SciPy, pandas, and scikit-learn. They provide a common data structure for interoperability between these libraries.

NDArray vs Python List

Here are the considerations when it comes to choosing NumPy’s ndarray or the built-in Python List for your applications:

Choose ndarrays for:

  • Numerical computations, linear algebra, and data analysis tasks.
  • Handling large datasets efficiently.
  • Operations that benefit from vectorization and speed optimization.
  • Multidimensional data representation.

Choose Python lists for:

  • General-purpose data storage and manipulation.
  • Storing mixed data types (numbers, strings, objects).
  • Frequent modifications to the data structure, including additions and removals.
  • Cases where memory efficiency is less critical than flexibility.

Core Concepts

  • Arrays and Shapes: At the core of NumPy is the ndarray, a multi-dimensional array that can be one, two, or more dimensions. Understanding array shapes is crucial for performing operations and ensuring compatibility with other arrays.
  • Indexing and Slicing: NumPy provides efficient ways to index and slice arrays, allowing for easy extraction of subsets of data.
  • Broadcasting: Broadcasting is a powerful feature that allows NumPy to perform operations on arrays of different shapes, making code concise and readable.
  • Universal Functions (ufuncs): These are functions that operate element-wise on arrays, making it easy to perform mathematical operations without the need for explicit looping.

Key Features

Let’s delve into the key features of NumPy with some basic code examples to illustrate its power and simplicity. The examples are available in this Colab notebook.

First import NumPy

import numpy as np

Creating arrays

# Create a 1D array from a list, data type will be inferred from the list
array1 = np.array([1, 2, 3, 4, 5])
print("Array 1:", array1, ", data type:", array1.dtype)

# Create a 1D array with zeros, the default data type is float64
zeros_array = np.zeros(5)
print("Zeros Array:", zeros_array, ", data type:", zeros_array.dtype)

# Create a 1D array with ones
ones_array = np.ones(5)
print("Ones Array:", ones_array, ", data type:", ones_array.dtype)

# Create a 1D array with a range of values
range_array = np.arange(1, 6)
print("Range Array:", range_array)

# Create a 1D array with evenly spaced values
linspace_array = np.linspace(0, 20, 5, dtype=np.int32)
print("Linspace Array:", linspace_array)

# Create a 1D array with a specific data type
float_array = np.array([1.0, 2.0, 3.0], dtype=np.float32)
print("Float Array:", float_array)

# Create a 2D array from a nested list
matrix1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Matrix 1:", matrix1)

# Create a 2D array with zeros
zeros_matrix = np.zeros((3, 3))
print("Zeros Matrix:", zeros_matrix)

# Create a 2D array with ones
ones_matrix = np.ones((3, 3))
print("Ones Matrix:", ones_matrix)

# Create a 2D identity matrix
identity_matrix = np.eye(3)
print("Identity Matrix:", identity_matrix)

# Create a 2D array with a range of values
range_matrix = np.arange(1, 10).reshape(3, 3)
print("Range Matrix:", range_matrix)

# Create a 2D array with random values
random_matrix = np.random.random((3, 3))
print("Random Matrix:", random_matrix)

Output:

Array 1: [1 2 3 4 5] , data type: int64
Zeros Array: [0. 0. 0. 0. 0.] , data type: float64
Ones Array: [1. 1. 1. 1. 1.] , data type: float64
Range Array: [1 2 3 4 5]
Linspace Array: [ 0 5 10 15 20]
Float Array: [1. 2. 3.]
Matrix 1: [[1 2 3]
[4 5 6]
[7 8 9]]
Zeros Matrix: [[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
Ones Matrix: [[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
Identity Matrix: [[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
Range Matrix: [[1 2 3]
[4 5 6]
[7 8 9]]
Random Matrix: [[0.27504503 0.79058491 0.29064163]
[0.55128476 0.07055209 0.52732632]
[0.6986602 0.59033596 0.10166757]]

Key Attributes

# Create NumPy arrays with different shapes
array1 = np.array([1, 2, 3, 4, 5])
array2 = np.array([[1, 2, 3], [4, 5, 6]])
array3 = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]], dtype=np.float32)

# Explore key attributes:

# Shape:
print("Shape of array1:", array1.shape) # Output: (5,)
print("Shape of array2:", array2.shape) # Output: (2, 3)
print("Shape of array3:", array3.shape) # Output: (2, 2, 2)

# Number of dimensions:
print("Number of dimensions (ndim) of array1:", array1.ndim) # Output: 1
print("Number of dimensions (ndim) of array2:", array2.ndim) # Output: 2
print("Number of dimensions (ndim) of array3:", array3.ndim) # Output: 3

# Data type:
print("Data type of array1:", array1.dtype) # Output: int32
print("Data type of array2:", array2.dtype) # Output: int32
print("Data type of array3:", array1.dtype) # Output: float32

# Item size:
print("Item size of array1:", array1.itemsize) # Output: 8
print("Item size of array2:", array2.itemsize) # Output: 8
print("Item size of array3:", array3.itemsize) # Output: 4

# Total number of elements:
print("Total number of elements in array1:", array1.size) # Output: 5
print("Total number of elements in array2:", array2.size) # Output: 6
print("Total number of elements in array3:", array3.size) # Output: 8

# Total number of bytes:
print("Total number of bytes in array1:", array1.nbytes) # Output: 40
print("Total number of bytes in array2:", array2.nbytes) # Output: 48
print("Total number of bytes in array3:", array3.nbytes) # Output: 32

Indexing and Slicing

# Indexing and Slicing for 1D Array
array1d = np.array([1, 2, 3, 4, 5])
print("1D Array:", array1d)

# Indexing
element_at_index_2 = array1d[2]
print("Element at Index 2:", element_at_index_2)

# Slicing
sliced_array_1d = array1d[1:3]
print("Sliced Array 1D (Index 1 to 3):", sliced_array_1d)

# Indexing and Slicing for 2D Array
array2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("2D Array:", array2d)

# Indexing
element_at_index_1_2 = array2d[1, 2]
print("Element at Index (1, 2):", element_at_index_1_2)

# Slicing
sliced_array_2d = array2d[:2, :2]
print("Sliced Array 2D (First two rows and columns):", sliced_array_2d)

Output:

1D Array: [1 2 3 4 5]
Element at Index 2: 3
Sliced Array 1D (Index 1 to 3): [2 3]
2D Array: [[1 2 3]
[4 5 6]
[7 8 9]]
Element at Index (1, 2): 6
Sliced Array 2D (First two rows and columns): [[1 2]
[4 5]]

Reshaping

# Create a 2D array
array2d = np.array([[1, 2, 3], [4, 5, 6]])

# Reshape to a 2D array with 3 rows and 2 columns
reshaped_array2d = np.reshape(array2d, (3, 2))
print("Original 2D Array (2x3):", array2d)
print("Reshaped 2D Array (3x2):", reshaped_array2d)

# Create a 3D array
array3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print("Original 3D Array (2x2x2):", array3d)

# Reshape to a 2D array with 4 rows and 2 columns
reshaped_array2d_from_3d = np.reshape(array3d, (4, 2))

print("Reshaped to 2D Array (4x2):", reshaped_array2d_from_3d)

Output:

Original 2D Array (2x3): [[1 2 3]
[4 5 6]]
Reshaped 2D Array (3x2): [[1 2]
[3 4]
[5 6]]
Original 3D Array (2x2x2): [[[1 2]
[3 4]]

[[5 6]
[7 8]]]
Reshaped to 2D Array (4x2): [[1 2]
[3 4]
[5 6]
[7 8]]

Broadcasting (element-wise operations)

# Create 1D arrays
array1 = np.array([1, 2, 3,])
array2 = np.array([4, 5, 6])
print("Array 1:", array1)
print("Array 2:", array2)
# Addition
addition_result = array1 + array2
print("Addition Result:", addition_result)
# Subtraction
subtraction_result = array1 - array2
print("Subtraction Result:", subtraction_result)
# Multiplication
multiplication_result = array1 * array2
print("Multiplication Result:", multiplication_result)
# Division
division_result = array1 / array2
print("Division Result:", division_result)
# Add a scalar to each element
array_plus_scalar_result = array1 + 5
print("Array + Scalar:", array_plus_scalar_result)
# Create a 2D array
array2d = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
print("2D Array:", array2d)
# Add the 1D array to each row of the 2D array
result_1d_plus_2d = array1 + array2d
print("1D Array + 2D Array:", result_1d_plus_2d)

Output:

Array 1: [1 2 3]
Array 2: [4 5 6]
Addition Result: [5 7 9]
Subtraction Result: [-3 -3 -3]
Multiplication Result: [ 4 10 18]
Division Result: [0.25 0.4 0.5 ]
Array + Scalar: [6 7 8]
2D Array: [[1 1 1]
[2 2 2]
[3 3 3]]
1D Array + 2D Array: [[2 3 4]
[3 4 5]
[4 5 6]]

Vectorized operations

def f(x):
"""
Custom function to perform some operation on each element.
This example squares each element.
"""
return x**2

arr = np.array([1, 2, 3, 4, 5])
print('Original Array: ', arr)

# Using vectorized operations:
arr_f = np.vectorize(f)(arr)
print('Transformed Array:', arr_f)

Output:

Original Array:  [1 2 3 4 5]
Transformed Array: [ 1 4 9 16 25]

Filters and aggregations

# Create a 1D array
array = np.array([1, 2, 3, 4, 5])

# Filter elements greater than 2
filtered_array = array[array > 2]

print("Original Array:", array)
print("Boolean Mask:", array > 2)
print("Filtered Array (elements > 2):", filtered_array)

# Calculate the sum of array elements
sum_result = np.sum(array)

# Calculate the mean of array elements
mean_result = np.mean(array)

# Find the maximum and minimum values in the array
max_value = np.max(array)
min_value = np.min(array)

print("Original Array:", array)
print("Sum:", sum_result)
print("Mean:", mean_result)
print("Max Value:", max_value)
print("Min Value:", min_value)

Output

Original Array: [1 2 3 4 5]
Boolean Mask: [False False True True True]
Filtered Array (elements > 2): [3 4 5]
Original Array: [1 2 3 4 5]
Sum: 15
Mean: 3.0
Max Value: 5
Min Value: 1

Linear Algebra operations

# Create a 3x2 matrix
matrix1 = np.array([[1, 2], [3, 4], [5, 6]])

# Create a 2x3 matrix
matrix2 = np.array([[1, 1, 1], [2, 2, 2]])

# Create a vector
vector = np.array([3, 3])

# Using np.matmul for matrix-matrix multiplication
matmul_result = np.matmul(matrix1, matrix2)

# Using @ operator for matrix-matrix multiplication
at_operator_result = matrix1 @ matrix2

# Using np.dot for matrix-vector multiplication
dot_result = np.matmul(matrix1, vector)

matrix1_transposed = np.transpose(matrix1)

print("Matrix1:")
print(matrix1)

print("Matrix2:")
print(matrix2)

print("\nVector:")
print(vector)

print("\nMatrix-Matrix Multiplication using np.matmul:")
print(matmul_result)

print("\nMatrix-Matrix Multiplication using @ operator:")
print(at_operator_result)

print("\nMatrix-Vector Multiplication using np.dot:")
print(dot_result)

print("\nMatrix Transposed:")
print(matrix1_transposed)

Output:

Matrix1:
[[1 2]
[3 4]
[5 6]]
Matrix2:
[[1 1 1]
[2 2 2]]

Vector:
[3 3]

Matrix-Matrix Multiplication using np.matmul:
[[ 5 5 5]
[11 11 11]
[17 17 17]]

Matrix-Matrix Multiplication using @ operator:
[[ 5 5 5]
[11 11 11]
[17 17 17]]

Matrix-Vector Multiplication using np.dot:
[ 9 21 33]

Matrix Transposed:
[[1 3 5]
[2 4 6]]

Conclusion

NumPy’s significance in the realm of numerical computing cannot be overstated. Its ability to handle large datasets efficiently, perform vectorized operations, and seamlessly integrate with other Python libraries makes it an indispensable tool for researchers, scientists, and developers alike. By understanding the core concepts and harnessing the power of NumPy, you unlock a world of possibilities for numerical analysis and scientific computing in Python.

--

--