🐼 Learn to code with Pandas 🐼
A key component of our organization’s work involves data collection, it is important that we efficiently filter and visualize data to better understand it. In this article I will introduce you to ‘Pandas’, one of the Python Libraries which we use to do so.
A library is a collection of files known as ‘modules’, which contain functions that are used by other programs.
To better visualize this, let’s take a money transfer system as an example.
As can be seen in the diagram below, the conversion of ‘USD to EUR’ or ‘INR to GBP’ represents the functions of this system, and these conversions together form a module. In the same way, the ‘check sender details’ and ‘check recipient details’ are small functions in the system, which together form another module. In Pandas, the ‘read_csv()’, ‘head()’, ‘tail()’ are the functions which together represent the module or library called ‘Pandas’.
What is Pandas?
Pandas is an easy to use open source library available in Python, which is widely used for data analysis and manipulation.
We use Pandas to extract data from different file formats such as comma-separated values (CSV), JSON, SQL, and Excel. This library offers a powerful and flexible two-dimensional data structure called DataFrame (Tabular Form), which helps developers to easily manipulate or clean data in tabular forms using Python script.
DataFrames in Pandas provide some smooth operations over data because of their integrated indexing. The indexing helps to manipulate data more easily using a range of functions, such as append(), drop(), which I have discussed ahead in brief .
Pandas integrated indexing has two important components, namely the row index and the column index. These are important as upon integration, they enable users to access a specific block of data also called Cell.
A Cell is a rectangular box that stores value in text, numbers, date, combination of numbers and text, etc.
Pandas is a flexible tool that helps in reshape, clean and manipulate data. As such, it is often used in machine learning, deep learning and neural networks.
Basic Pandas Operations
This step simply consists in naming the Pandas library in order to access its functions. The abbreviation ‘pd’ is used as the short form for Pandas, as you can see in some of the screenshots below.
Loading data in a DataFrame
The ‘read_excel()’ function is used to read data from the file using its URL or location address in your computer. In the example below, we have used the URL to load data from an Excel file, which we then stored in the DataFrame using the variable ‘data’ referred to in the script.
Printing a DataFrame
To print a section of a DataFrame, you can use the code ‘data.head()’ to print its top 5 rows, while ‘data.tail()’ is used to print the bottom 5 rows of the data.
Creating a DataFrame in Pandas
As you can see below, we have created a DataFrame using the function ‘DataFrame()’ with 3 columns ‘Organization_Names’, ’Sectors’, ’Country’.
In this function the text inside the square brackets is the data of the created DataFrame and the column names are representing it using colons.
Creating a new column in an existing DataFrame
To add a new column in an existing DataFrame, you will need to create a variable (word or alphabet) and save the data in it. In our DataFrame, we did so by using the script to produce a new variable named ‘sector_code’ in which we saved the data. This data was then added to a new column of the existing DataFrame.
Adding a new row to a DataFrame
To add a new row to a DataFrame, we will be creating a variable to store data and using the ‘append()’ function we can add to existing DataFrame. As can be seen in the screenshot below, we added a third row for Oxfam by using the aforementioned function, and created a variable named data to store the row’s information. This information was then added to the existing DataFrame as a new row.
‘ignore_index=True’ is used inside the function to assign the row a new index ‘3’.
Adding data to a specific cell in a DataFrame
If at any point you need to add data to a cell, you can do so by using the ‘at’ function. We used this to add the missing value in the third row of our DataFrame, in the ‘Sectors’ column. This function helps to access and add data to any cell, as and when needed.
Deleting a row in a DataFrame
You can delete an entire row in one command by using Pandas’ ‘drop()’ function and specifying the row number. The ‘axis=0’ represents the row and the ‘inplace=True’ helps to update the data frame with the updated data.
Deleting a column in a DataFrame
The ‘drop()’ function can also be used to delete a column in a Dataframe, by just specifying the column name. Here the ‘axis=1’ represents the column.
One of the main reasons why Pandas has been playing an important role in the data world, is that it makes it easier to work on large amounts of data which would otherwise require hours of manual work. Now that you are more familiar with Pandas and its DataFrame, I hope you would like to try it out yourself!
You can do so by copy-pasting the code below in Google Colab: https://colab.research.google.com/drive/1rkOvd4Es529RTK03vEMCSsmeGM8hgAOx?usp=sharing