Pandas by example: columns

Rodrigo Pombo
Hexacta Engineering
5 min readFeb 27, 2018

--

Let’s review the many ways to do the most common operations over dataframe columns using pandas.

import pandas as pd

Adding columns to a dataframe

The three most popular ways to add a new column are: indexing, loc and assign:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})

df["C"] = [1,2,3]
df.loc[:, "D"] = [1,2,3]
df = df.assign(E=[1,2,3])
df

Indexing is usually the simplest method for adding new columns, but it gets trickier to use together with chained indexing. It may add the column to a copy of the dataframe instead of adding it to the original. When this happens pandas will show a warning:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
df[df["A"] < 3]["C"] = 100
df

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
after removing the cwd from sys.path.

To avoid those cases, it’s better to use loc:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
df.loc[df["A"] < 3, "C"] = 100
df

loc has two limitations: it mutates the dataframe in-place and it can't be used with method chaining. If that's a problem for you, use assign:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
df = df.assign(C=[1,2,3]).assign(D=4)
df

assign is particularly useful when you want to create a new column based on a column from an intermediate dataframe. You can pass a lambda to assign to get the intermediate dataframe:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
df = df.assign(C=[1,2,3]).assign(D=lambda idf: idf["C"] * 2)
df

In the previous examples the column name is fixed, but you can also use variable column names:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
my_column_name = "C"
another_name = "D"
df = df.assign(**{my_column_name: [1,2,3], another_name: 100})
df

Another option, when you need to insert a column in a specific location, you can use insert:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
df.insert(loc=1, column="C", value=[1,2,3])
df

Finally, you can also use concat to add a new column:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
new_column = pd.Series([1,2,3])
df = pd.concat([df, new_column.rename("C")], axis=1)
df

No matter what method you use, a common mistake is adding a column with a different index:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]},
index=[0,1,2])
new_column = pd.Series([1,2,3],index=[2,3,4])
df["C"] = new_column
df

If you don’t care about the indexes and just want to add the column using the current items order, you can use the values array:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]},
index=[0,1,2])
new_column = pd.Series([1,2,3],index=[2,3,4])
df["C"] = new_column.values
df

Renaming columns

The easiest way to rename columns is:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})

df = df.rename(columns={"A":"X", "B":"Y"})
df

If you need to do something more complex with the name you can pass a lambda to rename:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
df = df.rename(columns=lambda cname: cname + "_" + cname)
df

You can also manipulate columns directly:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
df.columns = "column-" + df.columns.str.lower()
df

Changing columns order

You can change the order of columns by explicitly listing each column:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})

df = df[["A", "C", "B"]]
df

For larger dataframes is easier to use list operations to reorder the columns:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})
cols = df.columns.tolist()
column_to_move = "C"
new_position = 1
cols.insert(new_position, cols.pop(cols.index(column_to_move)))
df = df[cols]
df

Deleting columns

You can use dict operations, like del and pop, to remove columns:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})
del df["B"]
C = df.pop("C")

df

For multiple columns (or for keeping the original dataframe intact) you can use drop:

df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})
df = df.drop(["B", "C"], axis=1)
df

You can also use columns to select the columns to drop:

df = pd.DataFrame({"A1": [1,2,3],
"B2": [2,4,8],
"C2": [5,5,5]})
cols_to_drop = [cname for cname in df.columns if cname.endswith("2")]
df = df.drop(cols_to_drop, axis=1)
df

Sometimes is easier to select the columns you want to keep:

df = pd.DataFrame({"A1": [1,2,3],
"B2": [2,4,8],
"C2": [5,5,5]})
cols_to_keep = [cname for cname in df.columns if cname.endswith("2")]
df = df[cols_to_keep]
df

Thanks for reading.

--

--