The Herpetologist Social Scientist guide to NumPy

Because even Social Scientists have to deal with numbers and matrices

Giulio Gabrieli
The Herpetologist Social Scientist
6 min readMar 15, 2020

--

Social Scientists deals with numbers. Daily. Behavioral data, physiological signals, brain activation patters, all those are qualitative measurements that we can empirically investigate. In a previous article I introduced you to the magic world of Python for the Social Sciences, and I listed NumPy as one of my go-to libraries for data science. In this article, I’ll dig a little more into NumPy, a Python package for numerical computation, and I will give you examples of how to successfully implement it your workflow.

Coding in Python may be challenging, that’s why we are here! (Photo by Émile Perron on Unsplash)

What is NumPy

Accordingly to the official documentation, NumPy is the fundamental Python Package for Scientific Computing, and it really is. In fact, NumPy add support for among other things: n-dimensional vectors, linear algebra, and Fourier Transform. Moreover, the package allows to simplify operations on matrices, speeds up operations on big vectors, and allows to reduce the space necessary to host multi-dimensional array, reducing the computational power. But most of them all for our fields, NumPy supports “vectorized operations”. If this looks like a foreign language to you, I’ll promise you that by the end of this post you will want to use NumPy in your projects and analysis.

Installing Numpy

Before getting our hands dirty with NumPy, we need to clear a bit of prerequisites, and more specifically we need to have a Python environment with NumPy installed. If you don’t have your Python environment up and running (yet), you may checkout the instructions on one of my favourite book. To install NumPy, if you are using pip fire a terminal and run*

pip install numpy

*sudo may be required on Linux, or you may be required to replace pip with conda if you are on conda. But this is up to your environment, so have fun debugging it :)

A Python, because why not? (Photo by Markéta Marcellová on Unsplash)

NumPy in action: the Digit Span Memory Task

To show you how NumPy works, let’s do a small example. Let’s consider a Digit Span Memory Task, where participants are required to remember a the longest sequence of digits they can. What usually happens is that individuals remember between 5 and 9 digits, or how it is usually reported 7±2, meaning that humans are likely to remember, on the average 7 numbers, with a standard deviations of 2 (or at least the WEIRD —Western Educated Industrialized Rich Democratic— do).

In this example we will try to analyze a dataset of responses to a Digit Span Memory Task, obviously using NumPy.

Getting the data

Now that you have your tools ready, it’s time to get some data to analyze. I generated a dataset for you, that you can download from the Github repository.

The dataset contains 4000 measures for the span test, conducted on two different samples. Each samples is tested twice, such that:

  • The first 1000 lines are the first trial of population A
  • Lines from 1001 to 2000 are the second trial of population A
  • Lines 2001 to 3000 are the first trial of population B
  • Last 1000 lines are the second trial of population B

Participants’ order is fixed, such as that the participant in the first line in A, trial one, is the same participants of population A, second trial, line one. This allows us to perform paired tests.

What we hypothesize is (a) no differences between population A and B when comparing first vs first and second vs second trial, and (b) that both the population will perform significantly better during the second trial, as compared to the first trial. Finally, we expect (c) the second trials of the two populations to have mean = 7 and standard dev. = 2.

First, we have to import NumPy, and since we don’t wanna type the full name of the package every time, we assign it a shorter name, such as np.

import numpy as np

We can now read in our dataset, with the command below (make sure to change the path):

data = np.loadtxt('/path/to/spanmemory.txt')

To verify if the dataset is as we expect (4000 lines), we can use

data.shape
Out: 4000

Is this a list? We can check it out using:

type(data)Out: numpy.ndarray

data is a numpy.ndarray, which is an object that can use specific methods that extends the capabilities of the normal lists. You can read more about numpy.ndarray and all the available methods in the official documentation.

Now, let’s start dividing our dataset into the two populations. We can achieve this by slicing the dataset, as shown below. Similarly to simple list, we can use indexes to select specific values from a NumPy array (remember than Python starts to count from 0).

data_A = data[0:2000] #get up to the 2000th line
data_B = data[2000:] #get all the values from the 2001st onward

In the same way, we can divide the lines into first and second trials:

data_A_1 = data[0:1000]
data_A_2 = data[1000:2000]
data_B_1 = data[2000:3000]
data_B_2 = data[3000:]

Are there differences between A and B?

Our first hypothesis is that there are no differences between A and B in the respective trials, which means that data_A_1 = data_B_1 and data_A_2 = data_B_2.

To test this, we rely on Student’s t test. Unfortunately, NumPy does not come with a t-test function, but SciPy does, so we will have to import it as well. More specifically, the functions we want are ttest_ind and ttest_rel, available in scipy.stats. Therefore we can import SciPy’s stats module by using

from scipy import stats

Now, we can run an independent sample test between data_1_A and data_1_B, and data_2_A and data_2_B. If the data are good, we should see no differences (p > 0.05) between the sets.

stats.ttest_ind(data_A_1, data_B_1)
Out: Ttest_indResult(statistic=7.037594040966266, pvalue=2.6776110141062265e-12)
stats.ttest_ind(data_A_2, data_B_2)
Out: Ttest_indResult(statistic=10.726971662024104, pvalue=3.856740910831655e-26)

Something is terribly wrong here. The data are completely different. How is this possible? Let’s explore means and std.

data_A_1.mean()
Out: 5.875
data_B_1.mean()
Out: 4.915
data_A_2.mean()
Out: 6.911
data_B_2.mean()
Out: 5.953

What’s wrong here? It’s very easy! The assistant collecting data from population B is a huge Python fan, and he started to count from zero! He just confirmed this to me over the phone, so we can safely increase by 1 all the values in data_B_1 and data_B_2. Doing it in NumPy is as easy as writing array + 1. No jokes!

data_B_1 = data_B_1 + 1
data_B_1.mean(), data_B_1.std()
Out: (5.915, 3.073072566666788)
data_B_2 = data_B_2 + 1
data_B_2.mean(), data_B_2.std()
Out: (6.953, 1.998697325759956)

Now, we can re-run our tests, and hopefully verify that the two populations are not significantly different:

stats.ttest_ind(data_A_1, data_B_1)
Out: Ttest_indResult(statistic=-0.2932330850402614, pvalue=0.7693744491661901)
stats.ttest_ind(data_A_2, data_B_2)
Out: Ttest_indResult(statistic=-0.4702847701513781, pvalue=0.6382029058792937)

Great! Our first hypothesis is therefore confirmed.

How many elements can we remember? What if we practice? (Photo by Josh Applegate on Unsplash)

Is practice making perfect?

We can now test whether participants of the two populations are performing significantly better on their second trials. Similarly to the situation above, we can perform a test, in this case a paired test, and look at the p-value. If our hypothesis is correct, we should see significant differences between the first and second trials for both the groups:

stats.ttest_rel(data_A_1, data_A_2)
Out: Ttest_relResult(statistic=-9.001568703639066, pvalue=1.0999615317257972e-18)
stats.ttest_rel(data_B_1, data_B_2)
Out: Ttest_relResult(statistic=-9.425198396562068, pvalue=2.8916510808963836e-20)

There we are! Our second hypothesis is confirmed!

7±2?

Now that we know that the two datasets, collected in the same time point (first or second trials) are comparable, we can merge the two second trials recording, and qualitatively verify whether overall the memory span of our participants is of 7 elements ( ±2).

To merge the two ndarrays, we can use NumPy’s concatenate function:

data_2 = np.concatenate((data_A_2, data_B_2))

We can check the new shape of the array using the shape method:

data_2.shape
Out: 2000

And finally, mean and standard deviation of our new array:

np.round(data_2.mean()), np.round(data_2.std())
Out: (7.0, 2.0)

Amazing! All our hypothesis have been confirmed.

To sum up

To sum up, in this post I introduced you to the usage of NumPy arrays. We have learnt how to load a dataset using NumPy, check basic descriptive statistics (mean, std), modify all the values of an ndarray, concatenate and slicing arrays, verify the shape of a ndarray, and round numbers. Despite the fact that here we focused on uni-dimensional arrays, the same applies to multidimensional vectors.

I hope this post helped you in understading how to integrate NumPy in your workflow. Let me know in the comments if you have any doubt, theme you’d like me to discuss, or packages that you would love to learn how to use. Thank you for reading it so far, and enjoy your life as a Social Scientist with a Python!

--

--