Pandas by example: columns
Let’s review the many ways to do the most common operations over dataframe columns using pandas.
import pandas as pd
Adding columns to a dataframe
The three most popular ways to add a new column are: indexing, loc
and assign
:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
df["C"] = [1,2,3]
df.loc[:, "D"] = [1,2,3]
df = df.assign(E=[1,2,3])
df
Indexing is usually the simplest method for adding new columns, but it gets trickier to use together with chained indexing. It may add the column to a copy of the dataframe instead of adding it to the original. When this happens pandas will show a warning:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})df[df["A"] < 3]["C"] = 100
df
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
after removing the cwd from sys.path.
To avoid those cases, it’s better to use loc
:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})df.loc[df["A"] < 3, "C"] = 100
df
loc
has two limitations: it mutates the dataframe in-place and it can't be used with method chaining. If that's a problem for you, use assign
:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})df = df.assign(C=[1,2,3]).assign(D=4)
df
assign
is particularly useful when you want to create a new column based on a column from an intermediate dataframe. You can pass a lambda
to assign
to get the intermediate dataframe:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})df = df.assign(C=[1,2,3]).assign(D=lambda idf: idf["C"] * 2)
df
In the previous examples the column name is fixed, but you can also use variable column names:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})my_column_name = "C"
another_name = "D"
df = df.assign(**{my_column_name: [1,2,3], another_name: 100})
df
Another option, when you need to insert a column in a specific location, you can use insert
:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})df.insert(loc=1, column="C", value=[1,2,3])
df
Finally, you can also use concat
to add a new column:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})new_column = pd.Series([1,2,3])
df = pd.concat([df, new_column.rename("C")], axis=1)
df
No matter what method you use, a common mistake is adding a column with a different index:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]},
index=[0,1,2])new_column = pd.Series([1,2,3],index=[2,3,4])
df["C"] = new_column
df
If you don’t care about the indexes and just want to add the column using the current items order, you can use the values array:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]},
index=[0,1,2])new_column = pd.Series([1,2,3],index=[2,3,4])
df["C"] = new_column.values
df
Renaming columns
The easiest way to rename columns is:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})
df = df.rename(columns={"A":"X", "B":"Y"})
df
If you need to do something more complex with the name you can pass a lambda
to rename
:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})df = df.rename(columns=lambda cname: cname + "_" + cname)
df
You can also manipulate columns
directly:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8]})df.columns = "column-" + df.columns.str.lower()
df
Changing columns order
You can change the order of columns by explicitly listing each column:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})
df = df[["A", "C", "B"]]
df
For larger dataframes is easier to use list operations to reorder the columns:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})cols = df.columns.tolist()
column_to_move = "C"
new_position = 1cols.insert(new_position, cols.pop(cols.index(column_to_move)))
df = df[cols]
df
Deleting columns
You can use dict
operations, like del
and pop
, to remove columns:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})
del df["B"]
C = df.pop("C")
df
For multiple columns (or for keeping the original dataframe intact) you can use drop
:
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})df = df.drop(["B", "C"], axis=1)
df
You can also use columns
to select the columns to drop:
df = pd.DataFrame({"A1": [1,2,3],
"B2": [2,4,8],
"C2": [5,5,5]})cols_to_drop = [cname for cname in df.columns if cname.endswith("2")]
df = df.drop(cols_to_drop, axis=1)
df
Sometimes is easier to select the columns you want to keep:
df = pd.DataFrame({"A1": [1,2,3],
"B2": [2,4,8],
"C2": [5,5,5]})cols_to_keep = [cname for cname in df.columns if cname.endswith("2")]
df = df[cols_to_keep]
df
Thanks for reading.