Python Pandas
traine = pd.read_csv('./train.csv',
dtype={'Id':'str', 'keyword':'str'},
parse_dates=['Date'])
Select a column or columns in a dataframe
For pandas objects (Series, DataFrame), the indexing operator [] only accepts:
1. column name or list of column names to select column(s)
2. slicing or Boolean array to select row(s), i.e. it only refers to one dimension of the dataframe.
df[[column_name(s)]]
The interior brackets are for list, and the outside brackets are indexing operator, i.e. you must use double brackets if you select two or more columns. With one column name, single pair of brackets returns a Series, while double brackets return a dataframe.
Create a new column from the index of a dataframe
df['col_name'] = df.index
Replace a value in the index of a Series
"""
If a value is at the index of val_idx in the index of panda's series, replace it with new_val
"""
s = pd.Series()
idx_list = list(s.index)
idx_list[val_idx] = new_val
s.index = idx_list
Plot columns in a dataFrame
df.plot(x=’epoch’, y=[‘loss’, ‘val_loss’], style=’*-’)
Apply a function to each row in a dataframe, N.B. apply is SLLLLOOOW!
# 'datetime' is a column in dataframe 'traine' and it stores a datetime type
df['year'] = df.apply(lambda r: r['datetime'].year, axis=1)
Alternatively
df['year'] = df.datetime.apply(lambda r: r.year)#Customize function to apply
def foo(row, col1, col2, maxval):
res = row[col1] + row[col2] + maxval return resdf['newcol'] = df.apply(foo,args =(col1, col2,maxval,), axis=1)## N.B. To use 'apply' on the full dataframe is very slow, don't use it. Try to use apply on a single column instead or merge.
Select rows by condition
The recommended methods of indexing are:
.loc
if you want to label index.iloc
if you want to positionally index.
# 1. Select rows with equal condition, use '==' or '!='
df.loc[df['adContent']=='Logos', ['price', 'date']]# 2. Select rows whose column value is in an iterable, use isin:
df.loc[df['adContent'].isin(['Swag','Merch']), ['price', 'date']]Negate the condition with '~'# 3. Unequal condition, use '>' or '<':#Returns DataFrame or Series, you MUST use '()' for multiple conditions.Combine multiple conditions with '&' and '|'.df.loc[(df['price']>0)&(df['val']>5), ['keyword', 'name']]Alternatively:
df[df['price']>0][df['name']!='0'][['keyword', 'name']]
df[(df['price']>0)&(df['value']>2)][['keyword', 'name']]This is equivalent to:
a = df['price'] > 0
b = df['value'] > 2
c = a&b
df.loc[c, ['keyword', 'name']]# 4. Check for None or null condition, need to use isnull() instead of '==None' or 'is None'df.loc[(df['is_test']==True) & (~df['value'].isnull()),]# 5. Select row in Series
s.loc[index_label]# 6. Select multiple rows in data frame
df.loc[[label1, label2,...,labelN]]
or use slice
df.loc[label1:labelN:step]# 7.
Set cell values in data frame with conditions in both row and column
>>> df
a b c
0 1 4 7
1 2 5 8
2 3 6 9
>>> df.iloc[0,2]
7
>>> df.iloc[df.index[df.iloc[:,0]==2],[1,2]] = [11,12]
>>> df
a b c
0 1 4 7
1 2 11 12
2 3 6 9
Group rows in dataframe and aggregate
#1. Compute the sum of column 'value' for a each 'id', and return the maximum value among all IDs.
maxv = df.groupby('id').sum().sort_values('value', ascending=False)['value'].iloc[0]Alternatively to get the resulting sum column:
dfg = df.groupby('id').agg(['sum'])['value']['sum']#2. Compute the cumulative sum within a group along a column df = pd.DataFrame({'name':['j','j','j','m','m'], 'day':['M','Tu','We','M','We'], 'v':[10,10,20,15,25]})df['cum'] = df.groupby(['name'])['v'].cumsum()
Drop column in dataframe
df = df.drop(columns=['shop_id', 'item_id'])
Useful Functions for Data Exploration
df.dtypes
df.info()
df.x.value_counts(dropna=False)
df.x.isnull()
pd.set_option('max_columns', 100)
df.head().T #transpose head() output
df.nunique(dropna=False).sort_values(ascending=True) #count number of unique values in each column
df[columns].drop_duplicates()
df.describe()
pd.scatter_matrix(df)
df.corr(), plt.matshow()df.select_dtypes(include=['object'])