Pandas for Data Science: Learn the Basics of Pandas

Lakshmi Prakash
Design and Development
7 min readAug 31, 2022

Pandas is one of the few Python libraries that every data scientist must be familiar with. Pandas is an open-source library built on top of NumPy. It is very efficient in data analysis and data manipulation when yo have to deal with very large datasets. The good news is that Pandas is also very easy to use!

Why Pandas? Well, you can carry out many simple arithmetic and statistical operations on even MS-Excel, so why must one learn Pandas, you might wonder. Well, there is no doubting that MS-Excel is an extremely powerful tool, but Pandas is simply even more powerful, which is why it is highly popular among data scientists from around the world. With just a few lines of code, you can perform several complicated operations using this library.

In this post, we’ll check out the following:

  • Data Types in Python (Series and DataFrame)
  • Converting a .csv file into a DataFrame
  • Getting the top 5 and bottom 5 rows of a DataFrame (“head” and “tail” functions)
  • Getting information about a DataFrame
  • Getting a particular column from a DataFrame
  • Getting a particular row from a DataFrame
  • Getting a simple statistical summary of the DataFrame
  • Getting the sum total of columns
  • Plotting a simple graph in Pandas

Let’s get started now!

Data Types in Pandas:

There are 2 types of objects used in Pandas. One is a “Series”, which is used to create and store 1-dimensional values, like say, a column or a row. And the other one is “DataFrame”, which can be thought of as a table: it is 2-dimensional. A DataFrame contains rows and columns. In data science, you’ll be using Pandas DataFrames often.

You can either create Series and DataFrames as you code in Python, but usually, DataFrames are not created on Pandas: the data is imported from elsewhere because it can be hard to type all that data down, one by one, while writing a program. This data would often be in the form of comma-separated values (.csv) files. A .csv file can be converted into a DataFrame using Pandas, and you can then do lots of things with it.

Pandas for Data Science

Working with Pandas:

It is common for Python programmers to import pandas as “pd” to simply avoid typing 6 characters each time, when you can do with just 2, just to save time and effort. To begin with, install the library using “pip install” command and import the library on your file.

Now, you can create and edit objects on Pandas or import objects created and stored elsewhere (usually, it would be Excel spreadsheets) and manipulate them on Pandas.

Let’s consider this example data set: one that shows the details of different volunteers who volunteer for an animal rescue program.

A data set showing volunteer details for an animal rescue volunteer program in Excel

As a data scientist, you might want to learn several things about the data set you’re expected to work on. For now, the data set is in the form of an Excel spreadsheet, but we can convert it into a dataframe. How?

Converting a .csv file into a DataFrame in Pandas:

Use the function read_csv.() in Pandas. Voila! You have a dataframe! I totally love ❤️ dataframes in Pandas. A dataframe presents your data in a much more beautiful and structured way, and once a .csv file has been converted into a dataframe, several Pandas operations can be performed.

Note: You need to save the dataframe under some variable name so that you can perform the many functions you want to. Here, we have saved it as “df”, but you can use any variable name that is relevant, and that would be a better option in fact, especially when there are many datasets involved.

Creating a DataFrame using Pandas

Checking out the top 5 (and bottom 5) rows of your data set:

Now, this one has only 16 items (remember, the index always begins from 0, not 1), so it’s easy to scroll through and check out the data set. But what when you have 1000s or tens or hundreds of thousands of rows in your data set? In such cases, you can’t just scroll through and understand what your data set is about.

Someone might share a dataset with you, and ask you to work on it and share your findings, without giving you any additional information. Now, in a large data set, how do you begin to make sense of it? You can use the .head() function and the .tail() function to check out the top 5 rows and bottom 5 rows to get a basic understanding of what you’re dealing with. 5 is just the default number; you can get as many rows as you want to see by entering the number within brackets.

The .head() function in Pandas returns the top 5 rows by default

Getting information about a DataFrame:

You can use the .info() function to understand the dataset or DataFrame much better. It would quickly tell you how many rows and columns are there, how many elements in a column are “non-null”, the data types of these columns, and the memory it uses.

Using the .info function to get an idea of the contents of a DataFrame

Checking out a particular column or row:

How do you get Python Pandas to show you just one column that you’re interested in? You can do this in two ways, but the results would still be the same. You can either use df.{column name} or df[“column name”].

Getting the values of a column in a dataset

You can also use the .loc and .iloc functions to get a particular row or column by the index or position value. Note that there is a subtle difference between .loc and .iloc functions.

Using .loc to get the contents of a specific row by index number

Getting a simple statistical summary of the DataFrame:

You can use the .describe() function to quickly get a simple statistical summary of a DataFrame. This function would give you the total number of entries (rows) in the DataFrame and the minimum, maximum, mean, standard deviation, and percentile values of all the numerical columns. How cool is that? With one line of code, you can get all these values showing up for an entire DataFrame!

Using the .describe function in Pandas to get a statistical summary of the DataFrame

Getting the sum total of columns:

You can use the .sum() function in Pandas to get the sum total of all the columns in a DataFrame. But if you have many columns that use have non-numerical data (like the one we have here), then there’s no point in calculating the sum for all these values. For non-numerical columns, Python would return a string of all the values put together, as you can see here. And for very large datasets, that is totally useless and would only take up a lot of unnecessary work. In such cases, you can use .sum(numeric_only = True) to get the sum total values of only those columns with numerical values.

Getting the sum total of columns using the .sum function

Plotting a simple graph in Pandas:

Lastly, let us check out how to plot a graph in Pandas. Like all these operations we have seen so far, plotting a simple graph is also pretty easy in Pandas. You can do this using a single line of code as well. This is done using the .plot() function. But when your DataFrame has more than one column, by default, Pandas would try to plot all those numerical columns in one graph and it might not make sense.

If you want to see the graph of only one column, you have to be specific and use the name of the column. You can use something like this: df[“”Amount Donated \nin INR”].plot() and you’d get the graph for that particular column.

using the .plot() function in Pandas to plot a graph

For further information and to learn more, please check out the documentation here: https://pandas.pydata.org/docs/

Pandas Library in Python for Data Science

--

--

Lakshmi Prakash
Design and Development

A conversation designer and writer interested in technology, mental health, gender equality, behavioral sciences, and more.