Python Getting Started : How to Process Data with Numpy

Data Analysis Enthusiast
Data Analysis Enthusiast
8 min readAug 16, 2019

Write in the first

This is the fourth in a series of data analysis tutorials. If you want to follow the next tutorials, welcome to follow me or you can find me in my Facebook page: Data Analysis Enthusiast.

Today, I will mainly explain how to use Numpy.

It is not only the most used third-party library in Python, but also the base of data science such as SciPy and Pandas. The data structure it provides is “more advanced and efficient” than Python itself. So, the data structure provided by NumPy is the basis for Python data analysis.

I last talked about list in Python array structure, which is actually equivalent to the structure of an array. And a key data type in NumPy is about arrays, so why is there such a third-party array structure?

In fact, in standard Python, the value of the array is saved with list. Since the elements in the list can be any object, the list holds the pointer to the object. Although the concept of pointers is hidden in Python programming, arrays have pointers, and Python’s list is actually an array. So if I want to save a simple array [0,1,2], I need 3 pointers and 3 integer objects, which is very uneconomical for Python, wasting memory and computation time.

Make your Python scientific calculations more efficient with NumPy

Why use the NumPy array structure instead of Python’s own list? This is because the elements of the list are stored in system memory, and the NumPy array is stored in a uniform contiguous memory block. This array calculation traverses all the elements, unlike the list, which also needs to look up the memory address, saving computational resources.

Also in memory access mode, the cache loads the byte blocks directly from RAM into the CPU registers. Because the data is stored in memory continuously, NumPy directly computes multiple consecutive floating point numbers in the register using vectorization instructions from modern CPUs. In addition, the matrix calculation in NumPy can be multi-threaded, making full use of multi-core CPU computing resources, greatly improving the computational efficiency.

Of course, in addition to using NumPy, you need some tricks to improve memory and increase the utilization of computing resources. An important rule is to avoid implicit copying and to use local operations. For example, if I want a value x to be twice the original, I can write it directly as x*=2 instead of y=x*2.

This speed can be as fast as 2 times or more.

Since NumPy is so powerful, where do you start learning? There are two important objects in NumPy: ndarray (N-dimensional array object) solves the problem of multidimensional arrays, and ufunc (universal function object) solves the problem of arrays. The function being processed. Below, I will introduce to you one by one.

ndarray object

ndarray is actually the meaning of a multidimensional array. In a NumPy array, the dimension is called rank, the rank of a one-dimensional array is 1, the rank of a two-dimensional array is 2, and so on. In NumPy, each linear array is called an axis. In fact, the rank is the number of axes.

Below, you see how ndarray objects create arrays, and how do you deal with arrays of structures?

Create array

Running result

Before creating an array, you need to reference the NumPy library. You can create an array directly from the array function. If it is a multiple array, such as b in the example, what should you do? You can first treat an array as an element and then nest it. For example, [1, 2, 3] in example b is an element, then [4, 5, 6] [7, 8, 9] is also used as an element, then put the three elements into the [] array and assign them to Variable b.

Of course, arrays also have properties. For example, you can get the size of the array through the function shape property and the properties of the element through dtype. If you want to modify the value in the array, you can assign it directly. Note that the subscript is counted from 0, so if you want to modify b, the middle element in the 9th grid, the subscript should be [1 ,1].

Structure array

What if you want to count the names and ages of students in a class, as well as the language, English, and math scores? Of course, you can use the subscripts of the array to represent different fields. For example, the subscript with 0 is the name and the subscript with 1 is age, etc., but this is not dominant.

In fact, in C, you can define an array of structures, that is, define the structure type through struct. The fields in the structure occupy a contiguous memory space. Each structure occupies the same memory size. How does it operate in NumPy?

Running result:

In this example, first in NumPy is the structure type defined by dtype, then when defining the array, use array type dtype=persontype in array, so you can freely use the custom persontype. For example, if you want to know the language score of each person, you can use chineses = peoples[:] [‘chinese’]. Of course, there are some maths in NumPy, such as calculating the average value using np.mean.

Creation of contiguous arrays

NumPy makes it easy to create contiguous arrays, such as I use arange or linspace to create:

In both ways we can get the difference array [1,3,5,7,9].

Arange() is similar to the built-in function range(). It creates a one-dimensional array of arithmetic progressions by specifying initial values, final values, and step sizes. The default is to exclude the final value.

Linspace is an abbreviation for linear space and represents the meaning of a linear bisector vector. Linspace() creates a one-dimensional array of arithmetic progressions by specifying initial values, final values, and number of elements. The default is to include the final value.

Arithmetic operation

With NumPy, you can freely create an arithmetic array, and you can also add, subtract, multiply, divide, find the nth power and the remainder.

Running result:

Taking the x1, x2 array as an example, we can do addition, subtraction, multiplication, division, nth power and the remainder between the two arrays. In the nth power, the elements in the x2 array are actually the number of times, and the elements of the x1 array are the base.

In the remainder function, you can either use np.remainder(x1, x2) or np.mod(x1, x2), and the result is the same.

Statistical function

If you want to have a clearer understanding of a bunch of data, you need to perform a descriptive statistical analysis of the data, such as understanding the maximum, minimum, and average values in the data, whether it conforms to the normal distribution, variance, and standard. They can give you a clearer understanding of this set of data.

Let me introduce how to use these statistical functions in NumPy.

Count group / maximum function in matrix amax(), minimum function amin().

Running result:

amin() is used to calculate the minimum value of an element in an array along a specified axis. For a two-dimensional array a, amin(a) refers to the minimum of all elements in the array, amin(a,0) is the minimum along the axis=0, and axis=0 is the element as [1, 4, 7], [2, 5, 8], [3, 6, 9] three elements, so the minimum value is [1,2,3], amin(a,1) is the minimum along axis=1, axis=1 is the element as [1,2,3], [4,5,6], [7,8,9] three elements, so the minimum value is [1,4,7]. Similarly, amax() is the maximum value of the elements in the array along the specified axis.

Statistical difference between maximum and minimum ptp()

Running result:

For the same array a, np.ptp(a) can count the difference between the maximum and minimum values in the array, which is 9–1=8.

Similarly, ptp(a,0) statistic is the difference between the maximum and minimum values along the axis=0 , ie 7–1=6 (of course 8–2=6, 9–3=6, the third line minus The first line has a ptp difference of 6), and ptp(a,1) counts the difference between the maximum and minimum along the axis=1, ie 3–1=2 (of course 6–4=2, 9 -7=2, that is, the difference between the ptp of the third column and the first column is 2).

Count the percentile of the array percentile()

Running result:

Median and average in the statistics array median(), mean()

Running result:

Median and average in the statistics array median(), mean()

Running result:

The average() function can be used to calculate the weighted average. The weighted average means that each element can be set with a weight. By default, the weight of each element is the same, so np.average(a)=(1+2+3+4 ) /4=2.5, you can also specify the weight array wts=[1,2,3,4], so the weighted average np.average(a,weights=wts)= (1*1+2*2+3*3 +4*4)/(1+2+3+4)=3.0.

Conclusion

In NumPy learning, the main thing you need to master is the use of arrays, because this is the biggest difference between NumPy and standard Python. The array is redefined in NumPy, and both arithmetic and statistical operations are provided.

If you want to learn more about how to analyze data, welcome to follow me and check the previous articles, or you can find me in Facebook page: Data Analysis Enthusiast.

--

--

Data Analysis Enthusiast
Data Analysis Enthusiast

In big data era, how to make data become power? Follow me or my Facebook page: “Data Analysis Enthusiast” to know more about how to analyze data!!