Beginner’s Guide to Data Analysis using numpy and pandas

Published in

Analytics Vidhya

8 min readJul 26, 2020

Oftentimes, we tend to forget that the pandas library is built on top of the numpy package. In this comprehensive guide, we take full advantage of the fact that all numpy functionalities are also available in pandas.

Incorporating the necessary packages

To be able to make full use of the power of both pandas and numpy, we must import the necessary packages. As is the well-known convention, we rename them appropriately:

pandas renamed as pd; numpy renamed as np

In case we do not have these packages installed, we can do so though the terminal by typing the following command(s):

pip install pandas    # try pip3 if necessary
pip install numpy     # try pip3 if necessary

Once the packages have been imported and renamed, we have to use pd (for pandas) and np (for numpy). Otherwise, errors show up.

Creating DataFrame object

A DataFrame can be created from a list, a dictionary or even a numpy array. We populate a numpy array with random integers and build a DataFrame object out of it:

5 x 3 numpy array filled with random integers

Using the randint( ) function from the random module of numpy, we managed to create a numpy array having 5 rows and 3 columns. The shape is passed in the form of a tuple as a third argument to randint( ). The first and second arguments to randint( ) denote lower bound and upper bound respectively of the range of numbers using which we create our array. Random numbers are generated between 10 to (50–1) because it is exclusive of the upper bound. We now pass the array as an argument to DataFrame( ), resulting in the creation of a DataFrame object:

To display the content of df, which is nothing but a numpy array, we call upon the values attribute of DataFrame:

Invoking **values** attribute on df returns the numpy array

The row headers (0, 1, 2, 3, 4) are auto-generated and are in the form of a sequence; so are the column headers (0, 1, 2). To get the row headers, which in this case is an auto-generated sequence, we use the index attribute:

Valid row headers range from 0 to 4 with a step size of 1

To fetch column headers, which also is an auto-generated sequence, we use the columns attribute:

Valid column headers range from 0 to 2 with a step size of 1

Mind you, a sequence goes up to but not including the stop value. Therefore, for both row and column sequences, the stop parameter is 1 more than the last value.

Since the DataFrame object is a numpy array, we can index and/or slice it in the same way we would index and/or slice a numpy array. The general form is:

df.values[row_index, column_index]  # indexing
df.values[row_start:row_stop, col_start:col_stop]  # slicing

Display all columns of second row (row index = 1)

Display all columns of last row (row index = 4). A single value within [ ], like the one shown above, denotes all columns of the row index passed inside [ ]

Display all rows of second column (column index = 1)

All the slices that we see above are numpy arrays:

The **type( )** function confirms our claim

We can also access a particular element of the DataFrame:

Specifying the row index as well as the column index gives the element at their intersecting point

Assigning manual row headers and column headers

Creating a DataFrame object from a numpy array built using random integers between 10 to 50

The row and column headers are auto-generated. We can come up with our own headers as well:

Row labels range from R1 to R5. Column labels range from C1 to C3

Explicit indexing works on DataFrame objects

Using row and column labels, along with the attribute loc, we can extract any element from our DataFrame:

Intersection of R2 and C2 is the element 38

Implicit indexing works on DataFrame objects

In addition to labels R1 to R5 and C1 to C3, there are implicit row headers 0 to 4 which correspond to R1 to R5 respectively (0 for R1, 1 for R2, 2 for R3 and so on) and implicit column headers 0 to 2 which correspond to C1 to C3 respectively (0 for C1, 1 for C2 and 2 for C3). These can be made use of in conjunction with attribute iloc to fetch any element of our DataFrame:

Row index 1 corresponds to R2, Column index 1 corresponds to C2. Their intersection corresponds to 38

Subset of a DataFrame using implicit slicing

The iloc attribute helps to slice data as well. It is exclusive of the upper limit:

Fetching elements from rows R2 and R3 and columns C1 and C2

Subset of a DataFrame using explicit slicing

The loc attribute helps to slice data too. Here, we use the row and column headers that are manually assigned. Contrary to iloc, it is inclusive of both lower and upper limits:

Fetching elements from first 4 rows and first 2 columns

Relationship between DataFrame and Series

A DataFrame is a collection of Series objects. Every row in a DataFrame can be thought of as a Series object with column labels. Every column in a DataFrame can be thought of as a Series object with row labels. This can be established by checking the type:

R1 is a Series object with column labels C1 to C3

C1 is a Series object with row labels R1 to R5

Transposing a DataFrame

We can interchange rows and columns of a DataFrame object:

The T attribute transposes a DataFrame object

Finding the shape of a DataFrame object

The shape attribute returns a tuple containing the number of rows and number of columns present in a DataFrame object in that order:

We have 5 rows and 3 columns as reported by shape

Generalizing DataFrame creation

Let us devise a function which will take in parameters such as total number of rows, total number of columns and others and return a DataFrame object:

User-defined function to create a DataFrame

Here, we make use of random integers to generate a numpy array which is fed to DataFrame method of the pandas library. If we invoke this function with only two arguments, the upper_limit acts as the default argument and takes a default value of 10. Otherwise, the argument passed, overrides the predefined parameter. We are not relying on auto-generated row and column and headers. For row headers, we use the index attribute and assign it a list using list comprehension. For column headers, we use the columns attribute and assign it a list using list comprehension. We concatenate the string “R” for rows and “C” for columns with the corresponding row numbers and column numbers respectively. The row and column numbers are generated from the sequence starting from 1 and going up to but not including 1 more than the number of rows and number of columns respectively. We accept rows and columns as user inputs and generate a DataFrame accordingly:

Random integers are generated from 0 to 10 (default upper limit)

If we pass a third argument, it behaves as the upper limit of the random sequence being produced:

The third argument 20 overrides the default value of 10

Creating DataFrame from Series objects

A DataFrame can also be created from Series objects passed as key-value pairs of a dictionary:

Using Series objects matches and goals, we create a new DataFrame

Accessing individual columns by names

Every column of man_utd_df DataFrame is a Series object. Every column name is a dictionary key. We can retrieve individual column content by passing the key (column name) against [ ]:

Adding row to a DataFrame

The data that we have on five Manchester United legends could produce significant information — total goals scored and total matches played by them together. Using loc, we can specify the row name. The sum( ) method applied along zero axis (along every row of every column) gives the total value for every column:

Scored 530 goals and played 1358 matches together

Dropping a row from a DataFrame

After finding the sum total of goals scored and matches played, we may want to remove the row Total to enable further computation along columns. The drop( ) function comes in handy. We set the inplace parameter to True so that changes are reflected on the original DataFrame:

Adding column to a DataFrame

An important statistic to evaluate the goal-scoring prowess of a player is to find the average goals per game. For every individual, we divide the goals scored by the total number of matches played to generate a new column called Goal Rate. Broadcasting feature of numpy will be at work here because the computation would be applied across the two columns for every row:

Wayne Rooney has a better goal rate than Cristiano Ronaldo!

Dropping a column from a DataFrame

Once again the drop( ) function comes into the picture. However, the axis parameter takes a value of 1 (axis = 0 for rows and axis = 1 for columns) and inplace is set to True so that the result is reflected in the original DataFrame:

The Goal Rate column is no longer present

Displaying all possible statistics on numerical attributes

Important stats such as mean, median, quartiles and others can be displayed all at once using the describe( ) method. Result for every column is reported:

The **describe( )** method provides the complete picture

I would like to thank my dear friend Pritam for designing the poster.

I hope you found this post helpful. Please feel free to leave your comments, feedback, criticism, thoughts and everything else that comes along with them. See you soon!

Beginner’s Guide to Data Analysis using numpy and pandas

Written by Soumyajit Pal