One-stop Guide to Data Manipulation in Python
This article will be very helpful for the readers who are looking for one stop read for all the data manipulation codes and syntax. In this article ‘PANDAS’ library has been used for data manipulation. Pandas is a popular Python data analysis tool. It provides easy to use and highly efficient data structures. These data structures deal with numeric or labelled data, stored in the form of tables.
Topics covered in this article:
1. Dataset
U.S. Home Sales, 1963–2016: National and regional data on the number of new single-family houses sold and for sale. Also provides national data on median and average prices, the number of houses sold and for sale by stage of construction, and other statistics. From the U.S. Census, Economic Time Series Indicators. Data provided by Enigma.io from the U.S. Department of Commerce.
Dataset has 20,404 Rows and 20 Columns.
2. Get your data into a DataFrame
Load a DataFrame from a CSV file
df = pd.read_csv(‘file.csv’)df = pd.read_csv(‘file.csv’, header=0, index_col=0, quotechar=’”’,sep=’:’, na_values = [‘na’, ‘-‘, ‘.’, ‘’])
3. Saving a DataFrame
Saving a DataFrame to a CSV file
df.to_csv(‘name.csv’, encoding=’utf-8')
Saving a DataFrame to a Python dictionary
dictionary = df.to_dict()
Saving a DataFrame to a Python string
string = df.to_string()
4. Working with the whole DataFrame
Peek at the DataFrame contents
df.info()
Summary of column statistics
df.describe()
Head
df.head(n)
Tail
df.tail(n)
Shape (row-count, column-count)
df.shape
5. Working with Rows — Part 1
Keeping rows
df = df[df['cat_code']== 'SOLD']
Dropping rows
df = df[(df['dt_code']!= 'MEDIAN') & (df['dt_code']!= 'NOTSTD') &(df['dt_code']!= 'TOTAL')]
Dropping null values (NaN)
df = df[pd.notnull(df['dt_code'])]
6. Working with Columns — Part 1
Dropping columns
df = df.drop(['et_idx','et_code','et_desc','et_unit'], axis=1)
Keeping columns
df = df[['per_name','dt_code','dt_desc','dt_unit','val']]
Adding new columns to a DataFrame
df['dt_year'] = df['per_name'].str[0:4]
7. User defined functions
Creating a new column using functions
def price (x):
if x.dt_unit == "K": return(x.val*1000)
else: return(x.val)
df['dt_val'] = df[['dt_unit','val']].apply(price, axis=1)
8. Cleaning dataset
Creating three datasets
df_UNDERC = df[df['dt_code']==’UNDERC’]df_COMPED = df[df['dt_code']==’COMPED’]df_AVERAG = df[df['dt_code']==’AVERAG’]
Keeping columns
df_UNDERC = df_UNDERC[['per_name','dt_year','dt_val']]df_COMPED = df_COMPED[['per_name','dt_year','dt_val']]df_AVERAG = df_AVERAG[['per_name','dt_year','dt_val']]
Renaming columns
df_UNDERC = df_UNDERC.rename(columns={'dt_val':'UNDERC'})df_COMPED = df_COMPED.rename(columns={'dt_val':'COMPED'})df_AVERAG = df_AVERAG.rename(columns={'dt_val':'AVERAG'})
9. Joining/Combining DataFrames and Groupby
Merge on columns
df_new = pd.merge(left=df_AVERAG, right=df_COMPED, how='left',left_on='per_name',right_on='per_name')df_new = pd.merge(left=df_new, right=df_UNDERC, how='left',left_on='per_name',right_on='per_name')
Grouping — Applying an aggregating function
gb = df_new.groupby(['dt_year'])['AVERAG','COMPED','UNDERC'].agg(np.mean)
10. Graphs and Statistics
Single line graph
gb['AVERAG'].plot.line(color= 'green')
Multiple lines graph
gb[['COMPED','UNDERC']].plot.line()
Correlation
gb.corr()
Quantiles
quants = [0.05, 0.25, 0.5, 0.75, 0.95]df_new = gb.quantile(quants)
Histogram
import matplotlib.pyplot as pltcount, bins = np.histogram(gb['AVERAG'],bins=5)plt.hist(gb['AVERAG'],bins=5,color='gray',edgecolor='white')
11. Working with Rows — Part 2
Sorting DataFrame rows values
gb = gb.sort_values(by='AVERAG',ascending=False)
Sort DataFrame by its row index
gb = gb.sort_index()
Select a slice of rows by integer position
gb = gb[32:]
12. Working with Columns — Part 2
Find index label for min/max values in column
print(gb['AVERAG'].idxmin())print(gb['AVERAG'].idxmax())
Maths on the whole DataFrame
print(gb.count())print(gb.mean())
Common column-wide methods
print(gb['AVERAG'].count())print(gb['AVERAG'].mean())print(gb['AVERAG'].describe())
Common element-wise methods
s = gb['AVERAG'].fillna(0)s = df_new['AVERAG'].cumsum()
Apply numpy mathematical functions to columns
df_new['log_data'] = np.log(df_new['AVERAG'])df_new['rounded'] = np.round(df_new['log_data'], 2)
13. Working with cells
Selecting a cell by row and column labels
print('AVG price in 1996: ',round(gb.loc['1996','AVERAG'],2))print('AVG price in 2006: ',round(gb.loc['2006','AVERAG'],2))print('AVG price in 2016: ',round(gb.loc['2016','AVERAG'],2))
Setting a cell by row and column labels
gb.loc['1976','AVERAG'] = -1000
Selecting and slicing on labels
df_new = gb.loc['2007':'2016', 'AVERAG':'UNDERC']
Selecting row
print(gb.loc['2016'])
Selecting column
print(gb['AVERAG'])
14. Working with missing values and strings
Drop all rows with NaN
df = df.dropna()
Drop all columns with NaN
df = df.dropna(axis=1)
Drop all rows where NaN appear more than twice
df=df.dropna(thresh=2)
Drop all rows where NaN appear in a special column
df = df.dropna(df['AVERAG'].notnull())
Recoding all missing data
df.fillna(0, inplace=True)
Recoding missing data in a special column
s = df['AVERAG'].fillna(0)
Working with strings
s = df['col'].str.lower()s = df['col'].str.upper()s = df['col'].str.len()s = df['col'].str.replace('old', 'new')
15. Pivot Tables
Pivot
df_pt = pd.pivot_table(df,index='dt_year',columns='dt_code',values='dt_val',aggfunc=np.mean)
Head
df_pt.head()
Tail
df_pt.tail()