How to solve data tasks quickly — An introduction to Pandas

Published in

Analytics Vidhya

5 min readMar 1, 2021

One of the most important tools for Data Science: Pandas — learn the most used functions in a short time and how to get results quickly. You start with a large data set: how can you read it, clean it, and evaluate it properly? I’ll show you the most important functions for this.

But first: What is Pandas and when is it used?
Pandas is an open-source analysis and manipulation tool built on Python. It was built to enable practical, real-world data analysis in Python. The advantages are its flexibility and ease of use. In this article, I will show the common steps while working on tasks with big datasets, most important features and explain how they are used to get started with your first projects.

Basics of Pandas:
Mainly there are two types of objects used in Pandas:
1. DataFrame: A table with an array of independent entries. These can be integers or strings. Pandas can thus be used universally for a wide variety of tasks. To create a DataFrame object, we use the function: pd.DataFrame()
2. Series: A series, on the other hand, is not a table but a list and can be created directly from a simple list.

Example of creating a DataFrame:

First of all, we have to import pandas:

import pandas as pd
from pandas import DataFrame

Creating a DataFrame with some random values:

df = DataFrame({'ImportantValues':[5,6,7],'EvenMoreImportantValues':[1,2,3]})

An enumeration (starting from 0) is automatically generated and shown on the left side. With print(df) we can see the output of our Dataframe:

It’s good to know how the basics work, however, for future projects we won’t be entering the data by hand, but rather pulling it from sources and importing it. For the import we use the function that reads CSV(comma separated values) files with: pd.read_csv(). For this article, I will use a random dataset as an example. Here I’m using the supermarket dataset from Kaggle and rename it to a shorter filename “super”.

supermarkt_data = pd.read_csv("../path_to_file/super.csv")

Handling: We have imported tables— what’s next?
Since the datasets we will be working with are very large — it will be confusing to output them as a whole. In order to make working with large datasets as efficient as possible, we use some functions to get an overview of the data and go into more detail on certain areas. To get a short insight into the data, we can use the following function, which gives us the first 5 entries:

print(supermarkt_data.head())

To output a specific value from a table (and in this example only the first value of it) the following function can be called:

your_dataframe['TheColumnYouLikeToSee'][0]

Useful functions:

If we want to create a single table from several tables (e.g. numbers for each year from different CSV files into a total table) we can solve this with a simple loop:

for file in files:
       df = pd.read_csv('./PATH!'+file)
       all_data = pd.concat([all])

Now to save this new table as a new CSV we simply run the lines:

all_data.to_csv("all_data.csv", index=false)

The last part “index=false” disables the automatic left column, which only contains an enumeration.

Data Cleaning
One of the most important tasks in data science projects is data cleaning. This can be a big part of the project and is essential for a good result. The result can only be as good as your input. It is not uncommon for data sets to be re-cleaned multiple times during a project after analysis, as many errors in the data sets can be detected from these results. A machine learning algorithm, for example, can only be as good as the data fed to it. The first thing you check for data cleaning: Are there any blanks or NaN (NotANumber) entries? For this, we use: “dataframe.isna()”.

NaN_df = your_data[your_data.isna().any(axis=1)]#check the results by running following line
NaN_df.head()

So if we have NaN-Values — the easiest method to get rid of them is by doing following:

our_data = our_data.dropna(how='all')

Sidenote:

axis = 0 drop rows which contain missing values.

axis = 1 drops columns that contain missing values

More information about dataframe.isna() and dataframe(dropna) can be found here and here.

Analysis— Extract Information from large datasets
Since it is hard to read and analyze large data sets, we use functions that summarize information for us. The best example of this is the total amount of revenue from a month of sales data. In order to give information about sales, losses, developments, we can sort the dataframe by certain series (e.g. year) and draw the sum from it.

supermarkt_data.groupby(‘gross income’).sum()

This gives us an overview of the totals for all years and we may be able to identify and graph initial trends, spot outliers and perform interpretations.

Filtering Data

It often happens that you need to filter information: In our case, for example, it could be the supermarket location (city), sales figures of a certain product, or time of purchase. As an example, I show how to filter the DataFrame by cities. For this, I have chosen the city ‘Yangon’.

To create a new DataFrame that contains only the information that is relevant for us, we can use the DataFrame.loc function.

new_df = our_data.loc[(our_data['City'] == 'Yangon')]

When we print the result of the new DataFrame out, we can make sure that the filtering worked:

The loc function is not limited to one characteristic. So several ‘and’ and ‘or’ functions can be used and also limits, like price > X, can be adjusted individually. As an example, here is a function that filters by the city of Yangon, a gross income for purchases above 10, and a product rating above 9.

new_df = our_data.loc[(our_data['City'] == 'Yangon') & (our_data['gross income'] > 10) & (our_data['Rating'] > 9)]

Filtering by City, gross income and rating

With the functions presented here, you can import CSV-files, read them, output specific values, merge multiple tables and export new ones, find and remove unfilled entries to clean your dataset, automatically calculate totals, and create new DataFrames with specific filter criteria. With these tools you can read, interpret, and target large data sets to get meaningful information.

As always: Thanks for reading!

How to solve data tasks quickly — An introduction to Pandas

Written by Alexander Popov