Python Pandas

Load data from a csv file by specifying types to certain fields

NeilZ
3 min readMar 14, 2018
traine = pd.read_csv('./train.csv', 
dtype={'Id':'str', 'keyword':'str'},
parse_dates=['Date'])

Select a column or columns in a dataframe

For pandas objects (Series, DataFrame), the indexing operator [] only accepts:

1. column name or list of column names to select column(s)

2. slicing or Boolean array to select row(s), i.e. it only refers to one dimension of the dataframe.

df[[column_name(s)]]

The interior brackets are for list, and the outside brackets are indexing operator, i.e. you must use double brackets if you select two or more columns. With one column name, single pair of brackets returns a Series, while double brackets return a dataframe.

Create a new column from the index of a dataframe

df['col_name'] = df.index

Replace a value in the index of a Series

"""
If a value is at the index of val_idx in the index of panda's series, replace it with new_val
"""
s = pd.Series()
idx_list = list(s.index)
idx_list[val_idx] = new_val
s.index = idx_list

Plot columns in a dataFrame

df.plot(x=’epoch’, y=[‘loss’, ‘val_loss’], style=’*-’)

Apply a function to each row in a dataframe, N.B. apply is SLLLLOOOW!

# 'datetime' is a column in dataframe 'traine' and it stores a datetime type
df['year'] = df.apply(lambda r: r['datetime'].year, axis=1)
Alternatively
df['year'] = df.datetime.apply(lambda r: r.year)
#Customize function to apply
def foo(row, col1, col2, maxval):
res = row[col1] + row[col2] + maxval
return resdf['newcol'] = df.apply(foo,args =(col1, col2,maxval,), axis=1)## N.B. To use 'apply' on the full dataframe is very slow, don't use it. Try to use apply on a single column instead or merge.

Select rows by condition

The recommended methods of indexing are:
  • .loc if you want to label index
  • .iloc if you want to positionally index.
# 1. Select rows with equal condition, use '==' or '!='
df.loc[df['adContent']=='Logos', ['price', 'date']]
# 2. Select rows whose column value is in an iterable, use isin:
df.loc[df['adContent'].isin(['Swag','Merch']), ['price', 'date']]
Negate the condition with '~'# 3. Unequal condition, use '>' or '<':#Returns DataFrame or Series, you MUST use '()' for multiple conditions.Combine multiple conditions with '&' and '|'.df.loc[(df['price']>0)&(df['val']>5), ['keyword', 'name']]Alternatively:
df[df['price']>0][df['name']!='0'][['keyword', 'name']]
df[(df['price']>0)&(df['value']>2)][['keyword', 'name']]
This is equivalent to:
a = df['price'] > 0
b = df['value'] > 2
c = a&b
df.loc[c, ['keyword', 'name']]
# 4. Check for None or null condition, need to use isnull() instead of '==None' or 'is None'df.loc[(df['is_test']==True) & (~df['value'].isnull()),]# 5. Select row in Series
s.loc[index_label]
# 6. Select multiple rows in data frame
df.loc[[label1, label2,...,labelN]]
or use slice
df.loc[label1:labelN:step]
# 7.

Set cell values in data frame with conditions in both row and column

>>> df
a b c
0 1 4 7
1 2 5 8
2 3 6 9
>>> df.iloc[0,2]
7
>>> df.iloc[df.index[df.iloc[:,0]==2],[1,2]] = [11,12]
>>> df
a b c
0 1 4 7
1 2 11 12
2 3 6 9

Group rows in dataframe and aggregate

#1. Compute the sum of column 'value' for a each 'id', and return the maximum value among all IDs.
maxv = df.groupby('id').sum().sort_values('value', ascending=False)['value'].iloc[0]
Alternatively to get the resulting sum column:
dfg = df.groupby('id').agg(['sum'])['value']['sum']
#2. Compute the cumulative sum within a group along a column df = pd.DataFrame({'name':['j','j','j','m','m'], 'day':['M','Tu','We','M','We'], 'v':[10,10,20,15,25]})df['cum'] = df.groupby(['name'])['v'].cumsum()

Drop column in dataframe

df = df.drop(columns=['shop_id', 'item_id'])

Useful Functions for Data Exploration

df.dtypes
df.info()
df.x.value_counts(dropna=False)
df.x.isnull()
pd.set_option('max_columns', 100)
df.head().T #transpose head() output
df.nunique(dropna=False).sort_values(ascending=True) #count number of unique values in each column
df[columns].drop_duplicates()
df.describe()
pd.scatter_matrix(df)
df.corr(), plt.matshow()
df.select_dtypes(include=['object'])

--

--