Pandas for Data Science: Learn the Basics of Pandas
Pandas is one of the few Python libraries that every data scientist must be familiar with. Pandas is an open-source library built on top of NumPy. It is very efficient in data analysis and data manipulation when yo have to deal with very large datasets. The good news is that Pandas is also very easy to use!
Why Pandas? Well, you can carry out many simple arithmetic and statistical operations on even MS-Excel, so why must one learn Pandas, you might wonder. Well, there is no doubting that MS-Excel is an extremely powerful tool, but Pandas is simply even more powerful, which is why it is highly popular among data scientists from around the world. With just a few lines of code, you can perform several complicated operations using this library.
In this post, we’ll check out the following:
- Data Types in Python (Series and DataFrame)
- Converting a .csv file into a DataFrame
- Getting the top 5 and bottom 5 rows of a DataFrame (“head” and “tail” functions)
- Getting information about a DataFrame
- Getting a particular column from a DataFrame
- Getting a particular row from a DataFrame
- Getting a simple statistical summary of the DataFrame
- Getting the sum total of columns
- Plotting a simple graph in Pandas
Let’s get started now!
Data Types in Pandas:
There are 2 types of objects used in Pandas. One is a “Series”, which is used to create and store 1-dimensional values, like say, a column or a row. And the other one is “DataFrame”, which can be thought of as a table: it is 2-dimensional. A DataFrame contains rows and columns. In data science, you’ll be using Pandas DataFrames often.
You can either create Series and DataFrames as you code in Python, but usually, DataFrames are not created on Pandas: the data is imported from elsewhere because it can be hard to type all that data down, one by one, while writing a program. This data would often be in the form of comma-separated values (.csv) files. A .csv file can be converted into a DataFrame using Pandas, and you can then do lots of things with it.
Working with Pandas:
It is common for Python programmers to import pandas as “pd” to simply avoid typing 6 characters each time, when you can do with just 2, just to save time and effort. To begin with, install the library using “pip install” command and import the library on your file.
Now, you can create and edit objects on Pandas or import objects created and stored elsewhere (usually, it would be Excel spreadsheets) and manipulate them on Pandas.
Let’s consider this example data set: one that shows the details of different volunteers who volunteer for an animal rescue program.
As a data scientist, you might want to learn several things about the data set you’re expected to work on. For now, the data set is in the form of an Excel spreadsheet, but we can convert it into a dataframe. How?
Converting a .csv file into a DataFrame in Pandas:
Use the function read_csv.() in Pandas. Voila! You have a dataframe! I totally love ❤️ dataframes in Pandas. A dataframe presents your data in a much more beautiful and structured way, and once a .csv file has been converted into a dataframe, several Pandas operations can be performed.
Note: You need to save the dataframe under some variable name so that you can perform the many functions you want to. Here, we have saved it as “df”, but you can use any variable name that is relevant, and that would be a better option in fact, especially when there are many datasets involved.
Checking out the top 5 (and bottom 5) rows of your data set:
Now, this one has only 16 items (remember, the index always begins from 0, not 1), so it’s easy to scroll through and check out the data set. But what when you have 1000s or tens or hundreds of thousands of rows in your data set? In such cases, you can’t just scroll through and understand what your data set is about.
Someone might share a dataset with you, and ask you to work on it and share your findings, without giving you any additional information. Now, in a large data set, how do you begin to make sense of it? You can use the .head() function and the .tail() function to check out the top 5 rows and bottom 5 rows to get a basic understanding of what you’re dealing with. 5 is just the default number; you can get as many rows as you want to see by entering the number within brackets.
Getting information about a DataFrame:
You can use the .info() function to understand the dataset or DataFrame much better. It would quickly tell you how many rows and columns are there, how many elements in a column are “non-null”, the data types of these columns, and the memory it uses.
Checking out a particular column or row:
How do you get Python Pandas to show you just one column that you’re interested in? You can do this in two ways, but the results would still be the same. You can either use df.{column name} or df[“column name”].
You can also use the .loc and .iloc functions to get a particular row or column by the index or position value. Note that there is a subtle difference between .loc and .iloc functions.
Getting a simple statistical summary of the DataFrame:
You can use the .describe() function to quickly get a simple statistical summary of a DataFrame. This function would give you the total number of entries (rows) in the DataFrame and the minimum, maximum, mean, standard deviation, and percentile values of all the numerical columns. How cool is that? With one line of code, you can get all these values showing up for an entire DataFrame!
Getting the sum total of columns:
You can use the .sum() function in Pandas to get the sum total of all the columns in a DataFrame. But if you have many columns that use have non-numerical data (like the one we have here), then there’s no point in calculating the sum for all these values. For non-numerical columns, Python would return a string of all the values put together, as you can see here. And for very large datasets, that is totally useless and would only take up a lot of unnecessary work. In such cases, you can use .sum(numeric_only = True) to get the sum total values of only those columns with numerical values.
Plotting a simple graph in Pandas:
Lastly, let us check out how to plot a graph in Pandas. Like all these operations we have seen so far, plotting a simple graph is also pretty easy in Pandas. You can do this using a single line of code as well. This is done using the .plot() function. But when your DataFrame has more than one column, by default, Pandas would try to plot all those numerical columns in one graph and it might not make sense.
If you want to see the graph of only one column, you have to be specific and use the name of the column. You can use something like this: df[“”Amount Donated \nin INR”].plot() and you’d get the graph for that particular column.
For further information and to learn more, please check out the documentation here: https://pandas.pydata.org/docs/