pandas and NumPy arrays explained

This blog post covers the NumPy and pandas array data objects, main characteristics and differences.

What are NumPy and pandas?

Numpy is an open source Python library used for scientific computing and provides a host of features that allow a Python programmer to work with high-performance arrays and matrices. In addition, pandas s a package for data manipulation that uses the DataFrame objects from R (as well as different R packages) in a Python environment.

Both NumPy and pandas are often used together, as the pandas library relies heavily on the NumPy array for the implementation of pandas data objects and shares many of its features. In addition, pandas builds upon functionality provided by NumPy. Both libraries belong to what is known as the SciPy stack, a set of Python libraries used for scientific computing. The Anaconda Scientific Python distribution from Continuum Analytics installs both pandas and NumPy as part of the default installation.

NumPy arrays

NumPy allows you to work with high-performance arrays and matrices. Its main data object is the ndarray, an N-dimensional array type which describes a collection of “items” of the same type. For example:

>>Import numpy as np #importing the library

>>a1 = np.array([1, 2, 3, 4, 5]) #defining the ndarray


>>array([1, 2, 3, 4, 5]) #output

ndarrays are stored more efficiently than Python lists and allow mathematical operations to be vectorized, which results in significantly higher performance than with looping constructs in Python.

NumPy arrays allow for selecting array elements, logical operations, slicing, reshaping, combining (also known as “stacking”), splitting as well as a number of numerical methods (min, max, mean, standard deviation, variance and more). All these concepts can be applied to pandas objects, which extend these capabilities to provide a much richer and more expressive means of representing and manipulating data than is offered with NumPy arrays.

pandas Series Object

The Series is the primary building block of pandas. A Series represents a one-dimensional labeled indexed array based on the NumPy ndarray. Like an array, a Series can hold zero or more values of any single data type. A Series can be created and initialized by passing either a scalar value, a NumPy ndarray, a Python list, or a Python Dict as the data parameter of the Series constructor. This is an example of defining an ndarray:

Differences between ndarrays and Series Objects

There are some differences worth noting between ndarrays and Series objects. First of all, elements in NumPy arrays are accessed by their integer position, starting with zero for the first element. A pandas Series Object is more flexible as you can use define your own labeled index to index and access elements of an array. You can also use letters instead of numbers, or number an array in descending order instead of ascending order. Second, aligning data from different Series and matching labels with Series objects is more efficient than using ndarrays, for example dealing with missing values. If there are no matching labels during alignment, pandas returns NaN (not any number) so that the operation does not fail.

Source: “Learning pandas”, Michael Heyd (Packt Publishing).