NUMPY VS SCIPY
In this article you will come to know which is better, NumPy or SciPy?
What is NumPy?
NumPy is an abbreviation to Numerical Python. NumPy is a low level library written in C and FORTRAN for high level mathematical functions. It provides a high-performance multidimensional array object, and tools for working with these arrays and overcomes the problem of running slower algorithms. Any algorithm can then be expressed as a function on arrays, allowing the algorithms to be run quickly.
What is SciPy?
SciPy is an abbreviation to Scientific Python. SciPy is a library that uses NumPy for more mathematical functions. SciPy uses NumPy arrays as the basic data structure, and comes with modules for various commonly used tasks in scientific programming, including linear algebra, integration (calculus), ordinary differential equation solving, and signal processing.
Installation on NumPy and SciPy
- Install the latest version of Python through Python.org. Or see: How to install PIP on RHEL or CentOS 8 or install Numpy or Scipy in Python 3.7 in Windows 10.
- Download the latest version of pip from the command prompt or python console.
- After downloading pip just type in the command prompt
pip install Numpy
andpip install scipy
. - When the downloading finishes just type in your python IDE import Numpy as np and import Scipy as sc or whichever name you want to give.
- Now your Numpy and Scipy libraries have been imported and you can avail of the services offered by them.
NumPy Correlation Calculation
NumPy has many statistics routines, including np.corrcoef()
, that return a matrix of Pearson correlation coefficients. You can start by importing NumPy and defining two NumPy arrays. These are instances of the class ndarray
. Call them x
and y
:
>>> import numpy as np
>>> x = np.arange(10, 20)
>>> x
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>> y = np.array([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
>>> y
array([ 2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
Here, you use np.arange()
to create an array x
of integers between 10 (inclusive) and 20 (exclusive). Then you use np.array()
to create a second array y
containing arbitrary integers.
Once you have two arrays of the same length, you can call np.corrcoef()
with both arrays as arguments:
>>> r = np.corrcoef(x, y)
>>> r
array([[1. , 0.75864029],
[0.75864029, 1. ]])
>>> r[0, 1]
0.7586402890911867
>>> r[1, 0]
0.7586402890911869
corrcoef()
returns the correlation matrix, which is a two-dimensional array with the correlation coefficients. Here’s a simplified version of the correlation matrix you just created:
x yx 1.00 0.76
y 0.76 1.00
The values on the main diagonal of the correlation matrix (upper left and lower right) are equal to 1. The upper left value corresponds to the correlation coefficient for x
and x
, while the lower right value is the correlation coefficient for y
and y
. They are always equal to 1.
However, what you usually need are the lower left and upper right values of the correlation matrix. These values are equal and both represent the Pearson correlation coefficient for x
and y
. In this case, it’s approximately 0.76.
This figure shows the data points and the correlation coefficients for the above example:
The red squares are the data points. As you can see, the figure also shows the values of the three correlation coefficients.
SciPy Correlation Calculation
SciPy also has many statistics routines contained in scipy.stats
. You can use the following methods to calculate the three correlation coefficients you saw earlier:
Here’s how you would use these functions in Python:
>>> import numpy as np
>>> import scipy.stats
>>> x = np.arange(10, 20)
>>> y = np.array([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
>>> scipy.stats.pearsonr(x, y) # Pearson's r
(0.7586402890911869, 0.010964341301680832)
>>> scipy.stats.spearmanr(x, y) # Spearman's rho
SpearmanrResult(correlation=0.9757575757575757, pvalue=1.4675461874042197e-06)
>>> scipy.stats.kendalltau(x, y) # Kendall's tau
KendalltauResult(correlation=0.911111111111111, pvalue=2.9761904761904762e-05)
Note that these functions return objects that contain two values:
- The correlation coefficient
- The p-value
You use the p-value in statistical methods when you’re testing a hypothesis. The p-value is an important measure that requires in-depth knowledge of probability and statistics to interpret. To learn more about them, you can read about the basics or check out a data scientist’s explanation of p-values.
You can extract the p-values and the correlation coefficients with their indices, as the items of tuples:
>>> scipy.stats.pearsonr(x, y)[0] # Pearson's r
0.7586402890911869
>>> scipy.stats.spearmanr(x, y)[0] # Spearman's rho
0.9757575757575757
>>> scipy.stats.kendalltau(x, y)[0] # Kendall's tau
0.911111111111111
You could also use dot notation for the Spearman and Kendall coefficients:
>>> scipy.stats.spearmanr(x, y).correlation # Spearman's rho
0.9757575757575757
>>> scipy.stats.kendalltau(x, y).correlation # Kendall's tau
0.911111111111111
The dot notation is longer, but it’s also more readable and more self-explanatory.
If you want to get the Pearson correlation coefficient and p-value at the same time, then you can unpack the return value:
>>> r, p = scipy.stats.pearsonr(x, y)
>>> r
0.7586402890911869
>>> p
0.010964341301680829
This approach exploits Python unpacking and the fact that pearsonr()
returns a tuple with these two statistics.
Difference between NumPy and SciPy
- NumPy has a faster processing speed than other python libraries. SciPy on the other hand has slower computational speed.
- NumPy is basically for basic operations such as sorting, indexing, and elementary functioning on the array data type. On the other hand, SciPy contains all the algebraic functions some of which are there in NumPy to some extent and not in full-fledged form.
- The elements of the array in NumPy are homogenous. The NumPy array object keeps track of the array data type, its shape, and the dimensions. SciPy on the other hand has no such type restrictions on its array elements. The arrays in SciPy are independent to be heterogeneous or homogeneous.
Interest over time
Conclusion
NumPy and SciPy are two very important libraries to deal with the upcoming technological concepts. They are different conceptually but have similar functionality. Being a data scientist one needs to know how he can plot various distributions, find correlations between data points, integrate, differentiate data points, and many more. Moreover, complete statistics and probability knowledge should be the base of a data scientist and with the help of these amazing libraries one can carry out these functions with par easiness. So grab these amazing tools and explore the world of data science in a much smarter and easier way.