An Introductory Look on NumPy and Pandas

Mehmet Can Aydın
11 min readJun 20, 2022

--

NumPy and Pandas are two significantly popular modules found in Python. Both modules are very popular to be main components of Machine Learning and Neural Networks studies. This article is taking these modules on board to summarize their features.

NumPy

Python is developed by Guido van Rossum and first released at the beginning of 90’s as an open source programming language. With the increasing interest on Python, users contributed their work to the community. NumPy is also an open source project created by Travis Oliphant created in 2005. Before the release of NumPy, older libraries called “Numeric” and “Numarray” were in use for numerical and matrix operations. However, NumPy provided high-level functions and flexibility for operations on large arrays with multiple dimensions. Today, NumPy is one of the irreplaceable component of artificial intelligence studies especially on computer vision projects.

The fundamental data type of NumPy, arrays are simple yet useful for gathering data points together. NumPy arrays support only a single data type to be found in them. In other words, it is not possible to contain all data types in an array, like we do in Python lists. This main characteristic of arrays make them cheaper in terms of memory and this summarizes one of the reason for arrays to be found in machine learning studies. When working on massive amount of data, a practical data structure like NumPy arrays are required for faster calculations.

Structurally, arrays can be built in various ways as a single line of integers or multidimensional sets of lists.

Representation of arrays with 1, 2 and 3 dimensions

To take a closer look at how arrays can be created, how their shapes can be manipulated and what operations can be done on them, small segments of codes are given below.

>>> import numpy as np
>>> arr = np.array(range(0,12))

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

Note that it is a single line of array containing integers which built with range() function and stored in a variable called “arr”. This array can be reshaped into 2D or 3D arrays.

>>> arr.reshape((2,6))
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
>>> arr.reshape((2,2,3))
array([[[ 0, 1, 2],
[ 3, 4, 5]],
[[ 6, 7, 8],
[ 9, 10, 11]]])

Arrays can be built with different methods not only by Python lists or but also by built-in methods of NumPy.

>>> #generating 5 random integers ranging from 0 to 10
>>> np.random.randint(0,10,5)

array([4, 4, 7, 9, 0])
>>> #generating 4 random samples from standard normal distribution
>>> np.random.randn(4)
array([0.67531072, 1.05742167, 0.1285356 , 1.36193221])
>>> #generating 9 linearly spaced values in between 0 and 10
>>> np.linspace(0,10,5)
array([ 0. , 2.5, 5. , 7.5, 10. ])
>>> #generating an array of zeros with 5 elements
>>> np.zeros(5)
array([0., 0., 0., 0., 0.])

NumPy module is very popular because it contains many mathematical methods to apply on arrays very easily. Some of the examples are:

>>> arr = np.array(range(0,12))
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
>>> arr.sum() #sum of the values in array
66
>>> arr.min() #minimum value of array
0
>>> arr.max() #maximum value of array
11
>>> arr.std() #standard deviation of values in array
3.452052529534663
>>> arr.var() #variance of values in array
11.916666666666666

Computer vision is one of the areas that NumPy arrays are used very commonly. Pixels are the building blocks of images and they represent the information of colors in each point. Pixels are made of combinations of primary colors as Red, Green and Blue which can be expressed in numerical values. Therefore, images can be represented in arrays when layers of colors are individually contained in a multidimensional array. In each dimension, information of a primary color is carried and when those layers are superposed, image is displayed in its original color. In that way, image processing applications such as blurring can be done easily as it is basic matrix operation.

Representation of the cat in a NumPy array

Pandas

Pandas is another fundamental library taking an important place in data science. This important library is created by Wes McKinney in response to requirement of a high performance data analytics tool. The reason behind its popularity in data science is it provides a large variety of practical tools to analyze data and study on tables. Pandas DataFrame object is a 2 dimensional data structure which stores different kinds of data points in tabular form within rows and columns. This 2D shaped frames are built on Pandas Series which highly resemble NumPy Arrays. The main difference between series and arrays is that Pandas stores data with indices right next to data elements. Also, arrays can be multi-dimensional whereas Pandas objects are built for data analysis in 2 dimensions. To take a closer look at Pandas DataFrames, the famous Iris dataset is used below.

>>> import seaborn as sns
>>> import pandas as pd
>>> df = sns.load_dataset("iris")
>>> type(df)

<class 'pandas.core.frame.DataFrame'>

Note that data is imported from another library called “Seaborn” which is used for data visualization. However, Pandas supports reading data from a wide range of file formats such as .json, .csv or .xlsx.

Pandas have very useful functions to take a first look at dataset and observe descriptive statistics.

>>> df.shape          #returns back the shape information
(150, 5)
>>> df.info() #gives information about columns and rows
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
>>> df.describe().T #performs descriptive statistics
count mean std min 25% 50% 75% max
sepal_length 150.0 5.843333 0.828066 4.3 5.1 5.80 6.4 7.9
sepal_width 150.0 3.057333 0.435866 2.0 2.8 3.00 3.3 4.4
petal_length 150.0 3.758000 1.765298 1.0 1.6 4.35 5.1 6.9
petal_width 150.0 1.199333 0.762238 0.1 0.3 1.30 1.8 2.5
>>> df["species"].value_counts() #returns values found in a column
setosa 50
versicolor 50
virginica 50
>>> df.head(3) #returns first 3 rows of dataset
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa

With the functions above, dataset is summarized within a few rows of output. Iris dataset contains sepal and petal characteristics of different species of flowers. Above, it is seen that the dataset contains 150 rows and 5 columns. 4 of the columns contain characteristics of flowers in numeric form whereas the other one stores the information of which species carry those characteristics. It is also observed that there are 3 different kinds of species as setosa, versicolor and virginica as their presence rate is equal to each other. Now that functionalities of Pandas can be observed after dataset is introduced.

Operations on Variables

There can be always a need for performing specific operations on variables. Feature engineering is one of the main concepts in which variables are examined to extract features from present data. Thus, variables are subjected to various operations in feature engineering studies. Not only this but also many applications related with data science such as outlier detection require arrangement and transformation of variables. Pandas is very practical to perform operations on variables and saving them in new columns.

>>> df[["sepal_length", "sepal_width"]].head(3)
sepal_length sepal_width
0 5.1 3.5
1 4.9 3.0
2 4.7 3.2
>>> df["sepal_length"].std()
0.8280661279778629
>>> df["sepal_length"].max() - df["sepal_length"].min()
3.6000000000000005
>>> df["sepal_magnitude"] = df["sepal_length"] * 0.5 + df["sepal_width"] * 0.5>>> df["sepal_magnitude"].head(3)
0 4.30
1 3.95
2 3.95

Code segment above is focused on lengths and widths of sepals. Dataframe objects are made of series and mathematical operations on series can be done easily. As discussed before, statistical applications on series resemble uses of arrays. Standard deviation and range between maximum and minimum sepal lengths are calculated with the same convenience as shown here. Also, a new variable on sepal characteristics of plants is constructed with the name of sepal magnitude. This symbolic variable contains arithmetic mean of length and width values of sepals.

Subsetting and Selection

Subsetting and selection come in handy under many different circumstances. Pandas is well-equipped with various features to reach targeted data points. Use of iloc() and loc() functions are very common in that case. Difference between these two functions is that iloc stands for integer location and expects row and column number whereas loc stands for location and accepts column names in a string form. In other words, loc() is labed based while iloc is index based. They are both useful under definite circumstances for accessing targeted data. Selection can be performed with indexing operator. Multiple conditions can be used in selecting data.

>>> df.loc[0:2, ["sepal_length", "sepal_width"]]
sepal_length sepal_width
0 5.1 3.5
1 4.9 3.0
2 4.7 3.2
>>> df.iloc[0:3,0:3]
sepal_length sepal_width
0 5.1 3.5
1 4.9 3.0
2 4.7 3.2
>>> df[(df["sepal_length"] > 5.5) & (df["species"] == "setosa")]
sepal_length sepal_width petal_length petal_width species
14 5.8 4.0 1.2 0.2 setosa
15 5.7 4.4 1.5 0.4 setosa
18 5.7 3.8 1.7 0.3 setosa

Above, selection of first three rows of sepal length and width columns are performed by both loc() and iloc() functions. Note that both functions accepted row numbers with index slicing where loc() included upper bound which is excluded by iloc(). This demonstrates the difference between indexing on label and index basis. However, main difference between those functions are observed at selection of columns where iloc() accepted column numbers and loc() accepted a list of column names. At the bottom, selection depending on the criteria of sepal length higher than 5.5 of setosa species is performed. As seen, criteria are expressed within parantheses and combined with an ampersand which referring to “and” operator. Other popular operators are “or” operator expressed with a vertical bar “|” and “not” operator expressed with a tilde “~”.

Apply and Lambda Expressions

In major studies, it is required to perform more complex operations on variables. In such cases, Python’s ability to generate anonymous functions with lambda expressions are used widely. A lambda expression is a one-line function with no name to be used to be used on a short period. Lambda expressions enable implementing desired operations on all data listed under a variable with apply() method. Within apply() method, lambda expressions can be defined or any other functions can be used on variables. This kind of applications are very common in data preprocessing where data is rearranged and manipulated for better machine learning results.

>>> df.loc[:, ["sepal_length"]].apply(lambda x: x/10).head(3)
sepal_length
0 0.51
1 0.49
2 0.47
>>> df.loc[:,["sepal_length"]].apply(lambda x: x / x.max()).head(3)
sepal_length
0 0.645570
1 0.620253
2 0.594937

To be noticed here is apply() method returns a new dataframe after desired operation is performed on variables. So that all dataframe methods can be applied right next to apply(), just like head() method used above to list only first 3 elements of the new dataframe. First command here is applied on sepal length column to divide all data by 10. Second command demonstrates a scaling application. Within apply() method, a lambda expression is defined to divide all values by the maximum value located under sepal length column. Thus, value scale of this column is reduced to [0, 1] where 0 represents any 0 among data and 1 represents the maximum value.

Grouping and Aggregation

For statistical analysis, data is grouped by categories and subjected to aggregation functions like counting or summing. In such procedures, groupby() and agg() methods are utilized very often. Groupby method splits all dataframe depending on categories which are listed under entered categorical variable. Thus, dataset can be analyzed in terms of categories. However, without any operation on splitted data, groupby() does not return any value since data points are not aggregated yet. At that point, agg() method can be used on performing any aggregation function on any numerical column. This method expects a dictionary with keys stating target columns and values stating desired functions.

>>> df.groupby("species").agg({"sepal_length": ["mean", "std"]})
sepal_length
mean std
species
setosa 5.006 0.352490
versicolor 5.936 0.516171
virginica 6.588 0.635880
>>> df.groupby("species").agg({"sepal_width": ["mean", "std"]})
sepal_width
mean std
species
setosa 3.428 0.379064
versicolor 2.770 0.313798
virginica 2.974 0.322497

With the command above, all data is grouped by the species as setosa, versicolor and virginica. Within species, mean and standard deviation of sepal lengths are calculated with first command. Virginica species is observed to have the longest sepals where setosa species seemed to carry the smallest sepals. Second command applied same operation on sepal widths. Now it is observed that setosa species have sepals with the largest widths and sepals with the smallest widths belong to versicolor species. Another thing to be discussed here is that pivot tables which is the output of these commands. Pivot tables are useful structures to summarize statistical results based on categories. It can be seen that numerical columns can be splitted into subcolumns by application of multiple aggregation functions. Similarly, rows can be splitted into subcolumns by selection of multiple categorical variables. However, there is only one categorical variable which contains species information in this dataset. In this instance, a numerical variable can be used to generate a categorical variable. For such a procedure, numerical variable is sliced to ranges. Each data point of the numerical values stays in one of the generated ranges so a new column which specifies this range can be added. Application of this process is not as hard as it sounds with cut() method of Pandas. This method expects the numerical variable to be sliced and how many ranges are desired.

>>> df["sepal_range"] = pd.cut(df["sepal_length"], 3)
>>> df["sepal_range"].value_counts()
(5.5, 6.7] 71
(4.296, 5.5] 59
(6.7, 7.9] 20
>>> df.groupby(["species", "sepal_range"]).agg({"sepal_width": ["mean", "count"]})
sepal_width
mean count
species sepal_range
setosa (4.296, 5.5] 3.387234 47
(5.5, 6.7] 4.066667 3
(6.7, 7.9] NaN 0
versicolor (4.296, 5.5] 2.463636 11
(5.5, 6.7] 2.841667 36
(6.7, 7.9] 3.033333 3
virginica (4.296, 5.5] 2.500000 1
(5.5, 6.7] 2.909375 32
(6.7, 7.9] 3.123529 17

Above, sepal length column is sliced into 3 ranges and saved in the new sepal range column. With value_counts() method, ranges are observed to be between 4.296 to 5.5, 5.5 to 6.7 and 6.7 to 7.9. Thus, data can be categorized depending on the sepal lengths. With the last groupby application, a better instance of a pivot table can be observed. Rows are grouped by species primarily and then grouped by sepal ranges which are generated with cut() method. Note that grouping order depends on the order of the list in which species come first inside groupby() method. Average sepal widths and counts of specified groups can be seen. Under mean column there is a NaN value since there is not a single value under that category of setosa species with a sepal length between 6.7 to 7.9. Within setosa species, most sepals stay in the length range of (4.296, 5.5] as there are 47 elements under that category. For other species, it is [5.5, 6.7] range with 36 versicolor species and 32 virginica species.

In this article, I attempted to reflect my experience on NumPy and Pandas libraries as an introductory look on them. Thanks for reading!

--

--