Pandas 10 minute guide

Udbhav Pangotra
Geek Culture
Published in
3 min readJan 17, 2022

This will serve as a basic guide to get started with Pandas

Photo by Bench Accounting on Unsplash

What is it?

pandas is an open source Python library for data analysis. Python has always been great for prepping and munging data, but it’s never been great for analysis — you’d usually end up using R or loading it into a database and using SQL (or worse, Excel). Pandas makes Python great for analysis.

import pandas as pd

Loading in Data

The first step in any ML problem is identifying what format your data is in, and then loading it into whatever framework you’re using. For Kaggle competitions, a lot of data can be found in CSV files, so that’s the example we’re going to use.

df = pd.read_csv('input.csv')

The Basics

Now that we have our DataFrame in our variable df, let’s look at what it contains. We can use the function head() to see the first couple rows of the DataFrame (or the function tail() to see the last few rows).

df.head()
df.tail()

We can see the dimensions of the dataframe using the the shape attribute

df.shape

In order to get a better idea of the type of data that we are dealing with, we can call the describe() function to see statistics like mean, min, etc about each column of the dataset.

df.describe()

One of the most useful functions that you can call on certain columns in a dataframe is the value_counts() function. It shows how many times each item appears in the column.

df['Column'].value_counts()

In order to get all the columns use

df.columns

Sorting

Let’s say that we want to sort the dataframe in increasing order for the values of one column

df.sort_values('Column')

Dataframe Iteration

In order to iterate through dataframes, we can use the iterrows() function. Below is an example of what the first two rows look like. Each row in iterrows is a Series object

for index, row in df.iterrows():
print row
if index == 1:
break

Data Cleaning

One of the big jobs of doing well is of data cleaning. A lot of times, the data you have will have a lot of missing values in the dataset, which you have to identify. The following isnull function will figure out if there are any missing values in the dataframe, and will then sum up the total for each column.

df.isnull().sum()

Other Useful Functions

  • drop() — This function removes the column or row that you pass in (You also have the specify the axis).
  • agg() — The aggregate function lets you compute summary statistics about each group
  • apply() — Lets you apply a specific function to any/all elements in a Dataframe or Series
  • get_dummies() — Helpful for turning categorical data into one hot vectors.
  • drop_duplicates() — Lets you remove identical rows

This should serve as a very basic start to your pandas journey! Cheers!
More resources :

Do reach out and comment if you get stuck!

Other articles that might be interested in:
- Getting started with Apache Spark — I | by Sam | Geek Culture | Jan, 2022 | Medium
- Getting started with Apache Spark II | by Sam | Geek Culture | Jan, 2022 | Medium
- Getting started with Apache Spark III | by Sam | Geek Culture | Jan, 2022 | Medium
- Streamlit and Palmer Penguins. Binged Atypical last week on Netflix… | by Sam | Geek Culture | Medium
- Getting started with Streamlit. Use Streamlit to explain your EDA and… | by Sam | Geek Culture | Medium

Cheers and do follow for more such content! :)

You can now buy me a coffee too if you liked the content!
samunderscore12 is creating data science content! (buymeacoffee.com)

--

--