Essential Libraries To Have In Your Toolbox For Data Science And ML — Series #1 — NumPy

Kaan Ceylan
8 min readFeb 16, 2022

--

Illustration from realpython.com — https://realpython.com/numpy-tutorial/

In this series of blog posts I will try to introduce you to the must-have libraries if you’re going to use Python and will be working with data and/or ML. The stack will be split into two main parts, the first part is going to be about libraries that will let you work with, manipulate and visualize data. In the second part we will look at machine learning libraries and frameworks.

Anything that involves machine learning and data is a process that relies heavily on computational power without a doubt. You need to do vectorization, broadcasting (we will get to those in a bit) and work with tensors (which are just matrices on steroids). So NumPy, being written in mostly C, brings the computational power these operations need to Python. It is pretty much the go-to library for doing any computational work and data manipulation in Python. Whether your goal is to do broadcasting, indexing or a more ambitious one like writing an ML algorithm from scratch, NumPy is the answer. Let’s take a look at a code snippet where we’re using python’s built-in random module and NumPy’s to generate 10 million random integers and see how long it takes for both of them to run, then we’ll look at the results. I’ll be using timeit to see how long it takes for each line of code to run.

Python and Numpy Random Integers Code Snippet

As you can see, Python wins this round, being ~7 times faster than NumPy. Now let’s see what happens when we try to generate not integers but an array of random values. This is the part that matters the most since you will almost always be working with arrays.

Well well well, how the turntables…. Looks like this time NumPy is 8 times faster. This is due to the fact that NumPy processes these operations in a multithreaded way and when you add the speed that it gets from being written mostly in C, it becomes an indispensable tool for working with arrays.

Why does everyone use NumPy?

NumPy gives you an enormous range of fast and efficient ways of creating arrays and manipulating numerical data inside them. While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be homogeneous. The mathematical operations that are meant to be performed on arrays would be extremely inefficient if the arrays weren’t homogeneous. NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

What Makes NumPy So Good For Working With Arrays

  • Numpy arrays have to contain only a single data type which makes them really memory-friendly and easy to process.
  • NumPy makes these calculation in a multithreaded manner which, dividing its operations between all the available cores on your machine, which essentially cuts down the time that it takes to complete significantly. It natively runs on your CPU and doesn’t normally have GPU support unless you have a CUDA-enabled GPU which you can use to run similar calculations that you’d normally run on a CPU. For this reason most (if not all) ML frameworks, especially if you’re doing serialization and multi-threaded operations, run on GPU’s, but we’ll get to those in the upcoming posts.
  • NumPy is mostly written in C which makes it faster from the start since C code is compiled before it is run.
  • Beware though, there are some methods that are not multithreaded and written/implemented in plain C (i.e np.mean()).

Vectorization And Broadcasting Explained

Vectorization

When your code or calculations on arrays run scalarly, you can imagine them being processed by the CPU in a single line. Vectorization gives you the ability the make use of the idle processing power that is not being used and gives your code a significant performance boost. Implementing vectorization is a really important step if you are writing a machine learning algorithm from scratch.

Broadcasting

Broadcasting is used by NumPy when two arrays or matrices are used in a computation and their shapes do not match. The way this is handled is if one of the dimensions of the second matrix/array matches that of the first matrix/array, then the first dimension is sort of elongated by copying the same values until the two dimensions match. I know this is hard to put into a simple sentence so here are some illustrations from the official NumPy documentation.

Documentation image from https://numpy.org/doc/stable/user/basics.broadcasting.html

The second array is turned into a 4 by 3 matrix with each row consisting of values [1, 2, 3]. Then the addition operation is performed.

Documentation image from https://numpy.org/doc/stable/user/basics.broadcasting.html

The column numbers 3 and 4 do not match so we can’t do what we did in the first example. If the second array had a shape of (4, 1), it would have worked.

Basics Of NumPy and Some Core Functions

Creating n-dimensional arrays

You can create 1D, 2D, 3D etc arrays, well there’s basically no hard limit, but the limiting factor is the architecture of your machine’s hardware and its memory. There are workarounds that you can implement by changing the data type or using sparse matrices which only store non-zero values and their places in the matrix. You end up saving a lot of space in the long run if you have a really large sparse matrix with a lot of zero entries.

Don’t Try This At Home

Let’s do a little experiment. I’m going to run np.zeros() which basically generates a matrix with your specified shape that contains all zeros, with the data type of float64 and a column-major order which determines how the entries are ordered in the memory, column-major goes column by column while row-major goes row by row. With the order set to C, you can expect a better runtime for row-wise operations on the array and for F, column-wise operations run quicker.

I’ve tried to create this huge matrix on a colab instance with a memory of 12.7 gigabytes.

And I watched my session crash with the following error.

So as you can see, there are limits to what you can do but nevertheless NumPy is a really powerful tool.

Starter Functions Of NumPy

In this section we’ll take look at some basic but frequently used functions that can let you get started with NumPy. Let’s say that you’re working with a (5, 4) matrix and you’ve assigned a variable name test_arr to it. The first number giving you the row count, second is the column count for your matrix.

  • np.ndim() — Gives you the dimensions of your matrix. test_arr.ndim will output 2 since it’s a 2 dimensional matrix
  • np.shape() -Gives the shape of your matrix which is 5 by 4 for the test_arr
  • np.size() — Outputs the size of the given matrix. Matrix size is the product of its row and column count and for the test_arr it’s 20.
  • np.append() — If a 2 or higher dimensional matrix is passed to the function, the matrix is flattened by default (unless you specify an axis to append to) and then the given value is appended at the end of the array. If the values that you’ve appended is the same length as either the column or row length that the matrix had before, you can reshape it to the same shape it had before by using the reshape function and passing it the (6,4) or (5,5) values in this case. Or you can reshape it to any other shape as long as the size of the matrix checks out. You can use the values parameter to pass the values to be appended to the matrix, that will ensure that the new values are added to a copy of the original matrix. You can also use the axis parameter to specify the axis the values will be appended to. “1" will append the values to a new row, “0” will append it to a new column. Keep in mind that using the axis to append will require that the array shape you pass to append to be the same shape as the matrix you’re trying to append to.
  • np.insert() — Appends the passed value or values into the array before the specified index. Let’s see this in an example so that you can distinguish between append() and insert() better.

We’ve created an array called test_arr using the zeros function (giving you an array of shape that contains all zeros, you can also use ones).

Let’s use the insert function on test_arr. The 0 value determines the index that the 1 values will be inserted in and the 1 axis value states that the values will be inserted at the 0 index of each row.

Keep in mind that because we used it as “np.insert” and did not assign a new variable to this line of code, it did not actually overwrite the original test_arr variable and you can’t reference the results of this line later on.

  • np.linspace() and np.arange() — Both return a 1D array with values between the given start and stop points. These two are quite similar with the major difference being that you can specify the stepsize between two entries in the array in arange. Linspace determines the stepsize according to the interval that you pass to it and makes sure that the values are evenly spaced out.
  • np.transpose() — Lastly we have the transpose function which takes the columns of the given matrix and turns them into rows, changing the shape of a (3, 4) matrix into (4, 3) for example.

In this one I’ve tried to give you an idea of what’s going under the hood of numpy and what kind of advantages these provide, why it is so important to have under your belt and we’ve also learnt some basic functions to get you started using this awesome library in your own code. I hope this was helpful because in the upcoming posts we’re going to take a look at pandas, another very popular python library that actually relies on numpy to perform. Getting comfortable doing matrix operations and taking the logic behind it to your heart would make your life a lot easier when working with pandas. I’ll see you on the next post!

--

--

Kaan Ceylan

Aspiring ML Engineer. Enthusiast of data and everything about it. Doing my best to document my self-learning journey while hoping to help others.