Beginner’s Guide to Data Analysis using numpy and pandas
Oftentimes, we tend to forget that the pandas library is built on top of the numpy package. In this comprehensive guide, we take full advantage of the fact that all numpy functionalities are also available in pandas.
Incorporating the necessary packages
To be able to make full use of the power of both pandas and numpy, we must import the necessary packages. As is the well-known convention, we rename them appropriately:
In case we do not have these packages installed, we can do so though the terminal by typing the following command(s):
pip install pandas # try pip3 if necessary
pip install numpy # try pip3 if necessary
Once the packages have been imported and renamed, we have to use pd (for pandas) and np (for numpy). Otherwise, errors show up.
Creating DataFrame object
A DataFrame can be created from a list, a dictionary or even a numpy array. We populate a numpy array with random integers and build a DataFrame object out of it:
Using the randint( ) function from the random module of numpy, we managed to create a numpy array having 5 rows and 3 columns. The shape is passed in the form of a tuple as a third argument to randint( ). The first and second arguments to randint( ) denote lower bound and upper bound respectively of the range of numbers using which we create our array. Random numbers are generated between 10 to (50–1) because it is exclusive of the upper bound. We now pass the array as an argument to DataFrame( ), resulting in the creation of a DataFrame object:
To display the content of df, which is nothing but a numpy array, we call upon the values attribute of DataFrame:
The row headers (0, 1, 2, 3, 4) are auto-generated and are in the form of a sequence; so are the column headers (0, 1, 2). To get the row headers, which in this case is an auto-generated sequence, we use the index attribute:
To fetch column headers, which also is an auto-generated sequence, we use the columns attribute:
Mind you, a sequence goes up to but not including the stop value. Therefore, for both row and column sequences, the stop parameter is 1 more than the last value.
Since the DataFrame object is a numpy array, we can index and/or slice it in the same way we would index and/or slice a numpy array. The general form is:
df.values[row_index, column_index] # indexing
df.values[row_start:row_stop, col_start:col_stop] # slicing
All the slices that we see above are numpy arrays:
We can also access a particular element of the DataFrame:
Assigning manual row headers and column headers
The row and column headers are auto-generated. We can come up with our own headers as well:
Explicit indexing works on DataFrame objects
Using row and column labels, along with the attribute loc, we can extract any element from our DataFrame:
Implicit indexing works on DataFrame objects
In addition to labels R1 to R5 and C1 to C3, there are implicit row headers 0 to 4 which correspond to R1 to R5 respectively (0 for R1, 1 for R2, 2 for R3 and so on) and implicit column headers 0 to 2 which correspond to C1 to C3 respectively (0 for C1, 1 for C2 and 2 for C3). These can be made use of in conjunction with attribute iloc to fetch any element of our DataFrame:
Subset of a DataFrame using implicit slicing
The iloc attribute helps to slice data as well. It is exclusive of the upper limit:
Subset of a DataFrame using explicit slicing
The loc attribute helps to slice data too. Here, we use the row and column headers that are manually assigned. Contrary to iloc, it is inclusive of both lower and upper limits:
Relationship between DataFrame and Series
A DataFrame is a collection of Series objects. Every row in a DataFrame can be thought of as a Series object with column labels. Every column in a DataFrame can be thought of as a Series object with row labels. This can be established by checking the type:
Transposing a DataFrame
We can interchange rows and columns of a DataFrame object:
Finding the shape of a DataFrame object
The shape attribute returns a tuple containing the number of rows and number of columns present in a DataFrame object in that order:
Generalizing DataFrame creation
Let us devise a function which will take in parameters such as total number of rows, total number of columns and others and return a DataFrame object:
Here, we make use of random integers to generate a numpy array which is fed to DataFrame method of the pandas library. If we invoke this function with only two arguments, the upper_limit acts as the default argument and takes a default value of 10. Otherwise, the argument passed, overrides the predefined parameter. We are not relying on auto-generated row and column and headers. For row headers, we use the index attribute and assign it a list using list comprehension. For column headers, we use the columns attribute and assign it a list using list comprehension. We concatenate the string “R” for rows and “C” for columns with the corresponding row numbers and column numbers respectively. The row and column numbers are generated from the sequence starting from 1 and going up to but not including 1 more than the number of rows and number of columns respectively. We accept rows and columns as user inputs and generate a DataFrame accordingly:
If we pass a third argument, it behaves as the upper limit of the random sequence being produced:
Creating DataFrame from Series objects
A DataFrame can also be created from Series objects passed as key-value pairs of a dictionary:
Accessing individual columns by names
Every column of man_utd_df DataFrame is a Series object. Every column name is a dictionary key. We can retrieve individual column content by passing the key (column name) against [ ]:
Adding row to a DataFrame
The data that we have on five Manchester United legends could produce significant information — total goals scored and total matches played by them together. Using loc, we can specify the row name. The sum( ) method applied along zero axis (along every row of every column) gives the total value for every column:
Dropping a row from a DataFrame
After finding the sum total of goals scored and matches played, we may want to remove the row Total to enable further computation along columns. The drop( ) function comes in handy. We set the inplace parameter to True so that changes are reflected on the original DataFrame:
Adding column to a DataFrame
An important statistic to evaluate the goal-scoring prowess of a player is to find the average goals per game. For every individual, we divide the goals scored by the total number of matches played to generate a new column called Goal Rate. Broadcasting feature of numpy will be at work here because the computation would be applied across the two columns for every row:
Dropping a column from a DataFrame
Once again the drop( ) function comes into the picture. However, the axis parameter takes a value of 1 (axis = 0 for rows and axis = 1 for columns) and inplace is set to True so that the result is reflected in the original DataFrame:
Displaying all possible statistics on numerical attributes
Important stats such as mean, median, quartiles and others can be displayed all at once using the describe( ) method. Result for every column is reported:
I would like to thank my dear friend Pritam for designing the poster.
I hope you found this post helpful. Please feel free to leave your comments, feedback, criticism, thoughts and everything else that comes along with them. See you soon!