One-stop Guide to Data Manipulation in Python

Published in

Analytics Vidhya

5 min readSep 1, 2019

This article will be very helpful for the readers who are looking for one stop read for all the data manipulation codes and syntax. In this article ‘PANDAS’ library has been used for data manipulation. Pandas is a popular Python data analysis tool. It provides easy to use and highly efficient data structures. These data structures deal with numeric or labelled data, stored in the form of tables.

Topics covered in this article:

1. Dataset

U.S. Home Sales, 1963–2016: National and regional data on the number of new single-family houses sold and for sale. Also provides national data on median and average prices, the number of houses sold and for sale by stage of construction, and other statistics. From the U.S. Census, Economic Time Series Indicators. Data provided by Enigma.io from the U.S. Department of Commerce.

Dataset has 20,404 Rows and 20 Columns.

2. Get your data into a DataFrame

Load a DataFrame from a CSV file

df = pd.read_csv(‘file.csv’)df = pd.read_csv(‘file.csv’, header=0, index_col=0, quotechar=’”’,sep=’:’, na_values = [‘na’, ‘-‘, ‘.’, ‘’])

3. Saving a DataFrame

Saving a DataFrame to a CSV file

df.to_csv(‘name.csv’, encoding=’utf-8')

Saving a DataFrame to a Python dictionary

dictionary = df.to_dict()

Saving a DataFrame to a Python string

string = df.to_string()

4. Working with the whole DataFrame

Peek at the DataFrame contents

df.info()

Summary of column statistics

df.describe()

Head

df.head(n)

Tail

df.tail(n)

Shape (row-count, column-count)

df.shape

5. Working with Rows — Part 1

Keeping rows

df = df[df['cat_code']== 'SOLD']

Dropping rows

df = df[(df['dt_code']!= 'MEDIAN') & (df['dt_code']!= 'NOTSTD') &(df['dt_code']!= 'TOTAL')]

Dropping null values (NaN)

df = df[pd.notnull(df['dt_code'])]

6. Working with Columns — Part 1

Dropping columns

df = df.drop(['et_idx','et_code','et_desc','et_unit'], axis=1)

Keeping columns

df = df[['per_name','dt_code','dt_desc','dt_unit','val']]

Adding new columns to a DataFrame

df['dt_year'] = df['per_name'].str[0:4]

7. User defined functions

Creating a new column using functions

def price (x):
   if x.dt_unit == "K": return(x.val*1000)
   else: return(x.val)
   df['dt_val'] = df[['dt_unit','val']].apply(price, axis=1)

8. Cleaning dataset

Creating three datasets

df_UNDERC = df[df['dt_code']==’UNDERC’]df_COMPED = df[df['dt_code']==’COMPED’]df_AVERAG = df[df['dt_code']==’AVERAG’]

Keeping columns

df_UNDERC = df_UNDERC[['per_name','dt_year','dt_val']]df_COMPED = df_COMPED[['per_name','dt_year','dt_val']]df_AVERAG = df_AVERAG[['per_name','dt_year','dt_val']]

Renaming columns

df_UNDERC = df_UNDERC.rename(columns={'dt_val':'UNDERC'})df_COMPED = df_COMPED.rename(columns={'dt_val':'COMPED'})df_AVERAG = df_AVERAG.rename(columns={'dt_val':'AVERAG'})

9. Joining/Combining DataFrames and Groupby

Merge on columns

df_new = pd.merge(left=df_AVERAG, right=df_COMPED, how='left',left_on='per_name',right_on='per_name')df_new = pd.merge(left=df_new, right=df_UNDERC, how='left',left_on='per_name',right_on='per_name')

Grouping — Applying an aggregating function

gb = df_new.groupby(['dt_year'])['AVERAG','COMPED','UNDERC'].agg(np.mean)

10. Graphs and Statistics

Single line graph

gb['AVERAG'].plot.line(color= 'green')

Multiple lines graph

gb[['COMPED','UNDERC']].plot.line()

Correlation

gb.corr()

Quantiles

quants = [0.05, 0.25, 0.5, 0.75, 0.95]df_new = gb.quantile(quants)

Histogram

import matplotlib.pyplot as pltcount, bins = np.histogram(gb['AVERAG'],bins=5)plt.hist(gb['AVERAG'],bins=5,color='gray',edgecolor='white')

11. Working with Rows — Part 2

Sorting DataFrame rows values

gb = gb.sort_values(by='AVERAG',ascending=False)

Sort DataFrame by its row index

gb = gb.sort_index()

Select a slice of rows by integer position

gb = gb[32:]

12. Working with Columns — Part 2

Find index label for min/max values in column

print(gb['AVERAG'].idxmin())print(gb['AVERAG'].idxmax())

Maths on the whole DataFrame

print(gb.count())print(gb.mean())

Common column-wide methods

print(gb['AVERAG'].count())print(gb['AVERAG'].mean())print(gb['AVERAG'].describe())

Common element-wise methods

s = gb['AVERAG'].fillna(0)s = df_new['AVERAG'].cumsum()

Apply numpy mathematical functions to columns

df_new['log_data'] = np.log(df_new['AVERAG'])df_new['rounded'] = np.round(df_new['log_data'], 2)

13. Working with cells

Selecting a cell by row and column labels

print('AVG price in 1996: ',round(gb.loc['1996','AVERAG'],2))print('AVG price in 2006: ',round(gb.loc['2006','AVERAG'],2))print('AVG price in 2016: ',round(gb.loc['2016','AVERAG'],2))

Setting a cell by row and column labels

gb.loc['1976','AVERAG'] = -1000

Selecting and slicing on labels

df_new = gb.loc['2007':'2016', 'AVERAG':'UNDERC']

Selecting row

print(gb.loc['2016'])

Selecting column

print(gb['AVERAG'])

14. Working with missing values and strings

Drop all rows with NaN

df = df.dropna()

Drop all columns with NaN

df = df.dropna(axis=1)

Drop all rows where NaN appear more than twice

df=df.dropna(thresh=2)

Drop all rows where NaN appear in a special column

df = df.dropna(df['AVERAG'].notnull())

Recoding all missing data

df.fillna(0, inplace=True)

Recoding missing data in a special column

s = df['AVERAG'].fillna(0)

Working with strings

s = df['col'].str.lower()s = df['col'].str.upper()s = df['col'].str.len()s = df['col'].str.replace('old', 'new')

15. Pivot Tables

Pivot

df_pt = pd.pivot_table(df,index='dt_year',columns='dt_code',values='dt_val',aggfunc=np.mean)