Pandas Pipes

Romain
2 min readOct 28, 2018

--

I love the ability of using pipes (with the dedicated operator %>%) in R introduced by the magrittr package—we are using them for many years in *nix systems, the old good |. They let write data wrangling sequences in a very readable way — very close to a natural language. See it in action.

babynames %>% 
filter(name == "Eva") %>%
group_by(year) %>%
summarise(n = sum(n)) %>%
ggplot(aes(x = year, y = n)) + geom_line()
Number of babies (born in the USA) named “Eva” along the years ploted in R

Reading that and knowing the content of the babynames package, we can guess that we are trying to plot the evolution of the number of babies (born in the USA) named “Eva” along the years.

With python pandas there is no dedicated operator but calls to DataFrame methods can be chained by using the standard . operator that becomes in this case a kind of pipeline operator. Great for DataFrame methods but it is also possible to use the pipe method to wrap any function call. One additional drawback is the impossibility to indent the different calls properly. Fortunately a trick consisting in surrounding the pipeline by parenthesis permits to bypass this limitation. Here is what it looks like.

(
babynames
.query('name == "Eva"')
.groupby('year')
.agg({'n': 'sum'})
.plot(kind = 'line')
)
Number of babies (born in the USA) named “Eva” along the years ploted in Python

Not so bad!

Edit on piping any function

The pipe method can be used to call any function and integrate this call in the pipeline. Imagine you want to create a function (called number_by_year) to gather all your preparation code.

def number_by_year(df, b_name):
"""From the babynames dataset,
count the number of babies by year bearing the given name.
"""
return (df
.query('name == @b_name')
.groupby('year')
.agg({'n': 'sum'})
)
# This is how to use the function just defined
# in the pipeline thanks to the pipe method
(
babynames
.pipe(number_by_year, b_name="Eva")
.plot(kind = 'line')
)

And yes it is possible to do the same thing in R.

# From the babynames dataset, 
# count the number of babies by year bearing the given name.
number_by_year <- function(.data, b_name) {
.data %>%
filter(name == b_name) %>%
group_by(year) %>%
summarise(n = sum(n))
}
babynames %>%
number_by_year(b_name = "Eva") %>%
ggplot(aes(year, n)) + geom_line()

Originally published at back2code.svbtle.com.

--

--