Exploring Data Arrays with NumPy in Python!

Sara karim
4 min readJul 18, 2021

--

A significant part of a a data scientist’s role is to explore, analyze, and visualize data. There’s a wide range of tools and programming languages that they can use to do this; and of the most popular approaches is to use Jupyter notebooks (like this one) and Python.

Python is a flexible programming language that is used in a wide range of scenarios; from web applications to device programming. It’s extremely popular in the data science and machine learning community because of the many packages it supports for data analysis and visualization.

In this article, we’ll explore some of these packages, and apply basic techniques to analyze data. This is not intended to be a comprehensive Python programming exercise; or even a deep dive into data analysis.

Exploring data arrays with NumPy

Lets start by looking at some simple data.

Suppose a college takes a sample of student grades for a data science class.

The data has been loaded into a Python list structure, which is a good data type for general data manipulation, but not optimized for numeric analysis. For that, we’re going to use the NumPy package, which includes specific data types and functions for working with Numbers in Python.

Just in case you’re wondering about the differences between a list and a NumPy array, let’s compare how these data types behave when we use them in an expression that multiplies them by 2.

Note that multiplying a list by 2 creates a new list of twice the length with the original sequence of list elements repeated. Multiplying a NumPy array on the other hand performs an element-wise calculation in which the array behaves like a vector, so we end up with an array of the same size in which each element has been multiplied by 2.

The key takeaway from this is that NumPy arrays are specifically designed to support mathematical operations on numeric data — which makes them more useful for data analysis than a generic list.

You might have spotted that the class type for the numpy array above is a numpy.ndarray. The nd indicates that this is a structure that can consists of multiple dimensions (it can have n dimensions). Our specific instance has a single dimension of student grades.

The shape confirms that this array has only one dimension, which contains 22 elements (there are 22 grades in the original list). You can access the individual elements in the array by their zero-based ordinal position. Let’s get the first element (the one in position 0).

Alright, now you know your way around a NumPy array, it’s time to perform some analysis of the grades data.

You can apply aggregations across the elements in the array, so let’s find the simple average grade (in other words, the mean grade value).

So the mean grade is just around 50 — more or less in the middle of the possible range from 0 to 100.

Let’s add a second set of data for the same students, this time recording the typical number of hours per week they devoted to studying.

Now the data consists of a 2-dimensional array — an array of arrays. Let’s look at its shape.

The student_data array contains two elements, each of which is an array containing 22 elements.

To navigate this structure, you need to specify the position of each element in the hierarchy. So to find the first value in the first array (which contains the study hours data), you can use the following code.

Now you have a multidimensional array containing both the student’s study time and grade information, which you can use to compare data. For example, how does the mean study time compare to the mean grade?

KUDOS…!!! That’s it for now!

Now You carry on and if you have any queries regarding this article, then let me know in the comment section.

Good Luck!

In my next article, we’ll take a look at how to explore tabular data with Pandas and explore your data in more interesting ways.

Happy Programming!

--

--