The Power of DataFrames in Pandas: A Beginner’s Guide

Pratik Barjatiya
Data And Beyond
Published in
4 min readAug 13, 2019
The Power of PANDAS

Pandas — DataFrame is two-dimensional (2-D) data structure defined in pandas which consists of rows and columns. Each column in a DataFrame is a Series object, rows consist of elements inside Series.

Library Features

  • A fast and efficient DataFrame object for data manipulation with integrated indexing.
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form.
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
  • Columns can be inserted and deleted from data structures for size mutability.
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets.
  • High performance merging and joining of data sets.
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure.
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data.
  • Highly optimized for performance, with critical code paths written in Cython or C.
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Import Libraries:

import pandas as pd
import numpy as np

To know the pandas version you’re using:

pd.__version__

If you also need to know the versions of pandas’ dependencies, you can use the show_versions() function:

pd.show_versions()

Data frame Creation:

df = pd.DataFrame({'col one':[100, 200], 'col two':[300, 400]})

Reading Data from CSV :

df = pd.read_csv(‘data/file.csv’)

head() and tail():

head() — Return the first 5 rows by default.
head(2) — Return the first 2 rows as specified.
tail() — Return the last 5 rows by default.
tail(2) — Return the last 2 rows as specified.

shape:

Return the dimensions of the DataFrame.

info():

Print a concise summary of a DataFrame.
This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

describe():

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

value_counts():

Return a Series containing counts of unique values.

unique() and nunique():

unique() — Return unique values of column or series.
nunique() — Return the number of unique elements in the column or series.

sort_values():

Series.sort_values() — Sort by the values from series object.
DataFrame.sort_values() — Sort by the values from dataframe along with columns.

Sorting on Series.

dfSortSeries = df.sort_values()

Sorting on DataFrame.

dfSortDataFrame = df.sort_values(by=[‘col1’,’col2'])[[‘col1’,’col2’,’col3’]]

Boolean indexing:

In Boolean Indexing, Boolean Vectors can be used to filter the data. Multiple conditions can be grouped in brackets.

dfBooleanIndex.col = "value"

Replacing Null value with specific value:

# check null value 
df.isnull()
# check any null value present. Return True if null value found
df.isnull().values().any()
# Return the count of null values
df.isnull().sum()
# Return the null records
df[df.isnull()]
# Replace NaN value with 0
df.replace(np.nan,0,inplace=True)
# Fill null value with 0
df.fillna(0,inplace=True)
# Replace null value for a column with mean value
df['column_name'].fillna(df['column_name'].mean(),inplace=True)

Find Duplicate Rows based on selected columns or all columns:

# Select duplicate rows except last occurrence
dfDuplicates = df[df.duplicated(keep='first')]
# Select duplicate rows except first occurrence
dfDuplicates = df[df.duplicated(keep='last')]
# Select all duplicate rows based on one column
dfDuplicates = df[df.duplicated(['column_name'])]
# Select all duplicate rows based on two column
dfDuplicates = df[df.duplicated(['column_1','column_2'])]

String handling:

# Select all rows where the column starts with the string
df[df['column_name']].str.match('foo')
# Select all rows where the column contain a string
df[df['column_name']].str.contains('bar')
# Converting a column data type to string
df['column_name'] = df['column_name'].astype(str)
# Creating a new column, based on a column with last 4 characters
df['new_column'] = df['column_name'].str[-4:]

Applying a function on DataFrame:

# Apply a function on the column of the DataFrame
def multiply2(x):
num = x
num *= 2
return num
df[df['column_name'] < 0 ].column_name.apply(multiply2)# Apply anonymous function
df[df['column_name'] < 0 ].column_name.apply(lambda x: x * 4)

Indexing:

Indexing help us to select particular rows and columns of data from a DataFrame.

# An index is the row label in the DataFrame.
type(df.index)
# pandas.core.indexes.range.RangeIndex
df.index
# RangeIndex(start=RowNum, stop=RowNum, step=StepSize)
# Get the index key
df.index[20]
Note: We cannot assign value in index

set_index() and reset_index():

set_index(): Set the DataFrame index using existing columns.
reset_index() : Reset the index of the DataFrame, and use the default one instead.

df.set_index('Column_Name',inplace=True)df.reset_index(inplace=True)

sort_index():

Sorting DataFrame object by index column.

df.sort_index(inplace=True,ascending=False)

loc and iloc:

Use .loc[] to choose rows and columns by label.

# if column is set as index 
df.loc['foo'].sample(5)
# if the column is non-index
df.loc[column.name='foo'].sample(5)

Use .iloc[] to choose rows and columns by position.
Purely integer location based indexing for selection by position.

# fetch first 5 rows 
df.iloc[:5]
# fetch rows with index value 200, 300
df.iloc[[200,300]]
# We can play with index position using Boolean indexing
df.iloc[[True,False,True]]
# Get top6 even numbered positional index
df.iloc[lambda x: x.index % 2 == 0].head(6)

--

--

Pratik Barjatiya
Data And Beyond

Data Engineer | Big Data Analytics | Data Science Practitioner | MLE | Disciplined Investor | Fitness & Traveller