Python: Getting Started with Pandas II
Modifying DataFrames
one way we can add a new column to a DataFrame is by giving a list of the same length as the existing DataFrame. For example:
df['New Column Name'] = [4, 6, 8, 10, 2, 4, 12]We can also add a n ew column that is holds the same value in all rows:
df['Available Immediately'] = TrueFinally, we can add a new column by preforming a function on the other existing columns :
df['Sales Tax'] = df.Price*0.075we can use the apply function to apply function to every value in a particular column. For example, if we wanted to make all the the names in a list of names capitalised:
df['Name'] = df.Name.apply(upper)We can use (lower) for all lowercase letters.
Reviewing Lambda
Lambda functions work like regular functions but in a more concise form.
lambda1 = lambda x: ((x*2)+3)print(lambda1(5))
>>> 13
The general way of writing a lambda if function is as follows:
lambda x: OUTCOME IF TRUE if CONDITIONAL else OUTCOME IF FALSESo how can we apply a lambda function to a column in Pandas?
For example, if we wanted to get the e-mail provider from a list of e-mails, we could use the .split() method.
df['New Column Name']= df.Email.apply(lambda x: x.split('@')[-1])So from a list of e-mails for example, ana@hotmail.com, the result would be hotmail.com.
The string function will split the string into a list of strings at each space.
print('This is a string'.split())>>> ['This', 'is', 'a', 'string']
Now let’s look at how we can apply a lambda function to a row.
If we use apply without specifying a single column and add the argument “axis = 1”, the input to our lambda function will be an entire row, for this we use the following syntax:
row.column_nameorrow['column_name']
Let’s get a closer look at an example. If we have a DataFrame with 3 columns, Item, Price and Is Taxed? and we would like to add a new column with shows the final price of the items in the DataFrame after tax, assuming the item is taxed, we can do it using the following code:
df['Price With Tax'] = df.apply(lambda. row:
row['Price'] * 1.075
if row ['Is Taxed'] == 'yes'
else row ['Price'],
axis = 1)Renaming Columns
We can change the name of all the columns in our DataFrame at once by using .columns.
df = pd.DataFrame({
'name' : ['Josh', 'Debby'],
'age' : [28, 26]})df.columns = ['First Name', ' Age']
Using this command, the existing DataFrame is edited.
You can also rename individual columns by using .rename.
First let’s take a look at the overall syntax to use .rename
{'Original Column1' : 'New Name1',
'Original Column2' : 'New Name2'}So let’s try to re-do our first example using .rename.
df.rename(columns = {
'name' : 'First Name',
'age' : 'age' },
inplace = True)We have to use inplace = True, to make sure the current DataFrame is updated rather than creating a new one.
If you misspell one of the original column names, the command won’t fail, but it simply won’t change anything.
Aggregates in Pandas
An aggregate statistic is a way of creating a single number that describes a group of numbers.
To combine all of the values from a column for a single calculation we can use the following:
print(customers.age)
>>> [23, 25, 35, 35, 46, 62]print(customers.age.median())
>>> 35
The general syntax for these calculations is:
df.column_name.command()List of Common Commands
- mean: average of all values
- std: standard deviation
- median: median
- max: max value in column
- min: minimum value in a column
- count: count number of variables in a column
- nunique: number of unique values in a column
- unique: List of unique values in a column
The general syntax for calculating aggregate functions in combination with groupby:
df.groupby('column1').column2.measurement()column 1: column to group by
column 2 : column for measurement
measurement: function we want to apply.
After using a groupby we often need to clean up our results, because when we use it a Series is created. If we want those indices as a column, we use reset_index(). This will transform our series into a DataFrame and move the indices into their own column.
df.groupby('column1').column2.measurement().reset_index()If the operation we want to preform is more complicated than mean or count, we can use apply and lambda functions, just like we did for individual column operations. The input for our lambda function will always be a list of values.
high_earners =
df.groupby('category').wage.apply(lambda x: np.percentile(x,75)).reset_index())Pivot Tables
If we had a table that looked like:
Location Day Sales
Manchester M 300
Manchester Tu 310
Manchester W 320
...but think it would be more useful to view the able differently, as seen below:
Location M Tu W
Manchester 300 310 320
West Village 400 430 378We can do this by creating a pivot table. In pandas the command for pivot is:
df.pivot(columns = 'ColumToPivot',
Index = 'Column To Be Rows',
Values = 'Column To Be Values')Like we learned about groupby, the output of a pivot command is a new DataFrame, so we need to use .reset_index() to prevent our indexing from being weird.
Working with Multiple Tables in Pandas
Inner Merge
The .merge method looks for columns that are common between two DataFrames and then rows where those column’s values are the same. It combines the matching rows into a single row in a new table.
new_df = pd.merge(df1, df2)In addition using pd.merge, each DataFrame has it’s own merge method used as below:
new_df = df1.merge(df2)We use this when we are joining more than two DataFrames together because we can “cain” the commands:
new_df = df1.merge(df2).merge(df3).merge(df4)Merge on Specific Columns
In the previous example the function “knew” how to combine tables based on the columns that were the same between two tables. If this is not the case with the tables we are working with, one way we could address this is using .rename to rename the columns for our merges.
pd.merge(df1, df2.rename(columns{'id' :'df2_id'}))If we don’t want to use this method, we can also use the keywords left_on or right_on. The ‘left’ table is the one that comes first and the ‘right’ table is the one that comes second.
pd.merge(df1, df2, left on = 'df2_id', right_on = 'id', suffixes = [_df1', '_df2'] ) We use suffixes so that the merge doesn’t automatically rename the columns with _x and _y.
This is helpful when you merge two DataFrames together that have a column name that is the same name (for example, “id”) but the info does not correspond (for example, user id vs product id).
If there are any values which do not match using the pervious method, they will be left out when merging the DataFrames.
Inner Merge is the type of merge where two DataFrames whose rows don’t match perfectly, will result in the loss of the unmatched rows.
There are other types of merges that we can use when we want to keep information from the unmatched rows. An Outer Join would include all rows from both tables, even if they don’t match. Any missing values will be filled with ‘None’ or ‘nan’.
pd.merge(df1, df2, how = 'outer')Left or Right Merge
A left merge includes all the rows from the first (left) table, but only rows from the second (right) table that match the first.
pd.merge(company_a, company_b, how = 'left')In the code above, we would get all customers from company_a and only customers from company_b that are also customers of company_a.
Right merge is the exact opposite of left merge. Here the merged table will include all include all rows from the second (right) table but only rows from the first(left) that match the second.
pd.merge(company_a, company_b, how = 'right')Concatenate DataFrame
A DataFrame might be broken into multiple tables. For example if we have multiple CSV files for the same set of information. When we need to reconstruct a single DataFrame from multiple ones, we can use concat:
pd.concat([df1, df2, df3)]This method only works if all the columns are the same in all the DataFrames.





