EXCERPT

Data Frames in Pandas

From Pandas Workout by Reuven Lerner

Manning Publications

Published in

CodeX

6 min readOct 17, 2022

This article discusses using Data Frames in Pandas.

Take 35% off Pandas Workout by entering fcclerner2 into the discount code box at checkout at manning.com.

The main data structure that people use, and want to use, in pandas is the data frame. Data frames are two-dimensional tables that look and work similar to an Excel spreadsheet. The rows are accessible via an index. So long as you use .loc and .iloc to retrieve elements via the index, you’ll be fine.

But of course, data frames also have columns, each of which has a name. Each column is effectively its own series, which means that it has an independent dtype from other columns.

Figure 1. A simple data frame with five rows (`'row0'` through `'row4'`) and five columns (`'col0'` through `'col4'`)

In a typical data frame, each column represents a feature, or attribute, of our data, while each row represents one sample. So in a data frame describing company employees, there would be one row per employee, and there would be columns for first name, last name, ID number, e-mail address, and salary.

In this article, we’ll practice working with data frames in a variety of settings. We’ll practice creating, modifying, selecting from, and updating data frames. We’ll also see how just about every series method will also work on a data frame, returning one value per data frame column.

Useful references

s.locaccess elements of a series by labels or a boolean arrays.loc['a']https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.loc.html

s.iloc

access elements of a series by position

s.iloc[0]

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.iloc.html

s.quantile

Get the value at a particular percentage of the values

s.quantile(0.25)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.quantile.html

pd.concat

join together two data frames

df = pd.concat([df, new_products])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

df.query

Write an SQL-like query

df.query('v > 300')

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html

pd.read_csv

returns a new series based on a single-column file

s = pd.read_csv('filename.csv', squeeze=True)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

interpolate

returns a new data frame with NaN values interpolated

df = df.interpolate()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html

Brackets or dots?

When we’re working with a series, we can retrieve values in several different ways: Using the index (and loc), using the position (and iloc), and also using plain ol’ square brackets, which is essentially equivalent to loc.

When we work with data frames, we must use loc or iloc. That’s because square brackets refer to the columns. If you try to retrieve a row, via the index, using square brackets, you’ll get an error message saying that no such column exists.

It’s thus no surprise that many people are accustomed to using square brackets to retrieve columns. Typically, we’ll pass the column name as a string inside of the square brackets. For example:

df['a']  ❶
 df['first name']  ❷
 df[['a', 'b']] ❸
 df['c':'d'] ❹

❶ returns a series, the column a.

❷ returns a series, the column 'first name'. Notice that the column name contains a space.

❸ returns a two-column data frame, with columns a and b from df.

❹ returns the rows 'c' through 'd', inclusive.

Notice the final example from above: Square brackets always refer to columns, and never to rows. Except, that is, when you pass them a slice, in which case they look at the rows. If you want to retrieve multiple columns, then you must use fancy indexing. You cannot use a slice.

All of this is well and good, but it turns out that there’s another way to work with columns, namely “dot notation.” That is, if you want to retrieve the column colname from data frame df, you can say df.colname.

This syntax appeals to many people, for a variety of reasons: It’s easier to type, it has fewer characters and is thus easier to read, and it just seems to flow a bit more naturally.

But there are reasons to dislike it, as well: Columns with spaces and other illegal-in-Python-identifier characters won’t work. And I personally find that it gets confusing to remember whether df.whatever is a column named whatever or a pandas method named whatever. There are so many pandas methods to remember, I’ll take any help I can get.

I personally use bracket notation. If you prefer dot notation, you’re in good company — but do realize that there are some places in which you won’t be able to use it.

Exercise 8: Net revenue

For many people who use pandas at work, it’s rare to create a new data frame from scratch. You’ll create it from a CSV file, or you’ll perform some transformations on an existing data frame (or several existing series). But there are times when you’ll need to create a new data frame, and knowing how to do it is can be quite useful.

For this exercise, I want you to create a data frame that represents five different products sold by a company. For each product, we’ll want to know the product ID number (any unique two-digit integer will do), the product name, the wholesale price, the retail price, and the number of sales of that product in the last month. We’re just making it up here, so if you’ve always wanted to be a profitable starship dealer, this is your chance!

The task for this exercise is then to calculate how much net revenue you received from all of these sales.

Discussion

The first part of this task involved creating a new data frame. There are a number of ways to do this, including:

list of lists/series, in which each inner list represents one row, and the column names are taken positionally
list of dicts, in which the dict keys indicate which columns are set to each row
dict of lists/series, in which the dict keys determine the column names, and the values are then assigned vertically
2-dimensional NumPy array

Which of these is most appropriate depends on the task at hand. In this case, since I want to create and describe individual products, I decided to use a list of dicts.

Notice that thanks to the dictionary keys, I didn’t have to define or pass any column names. And the index was the default positional index, so I didn’t have to set that.

With my data frame in place, how can I calculate the total revenue? That’s going to require that for each product, we subtract the wholesale price from the retail price, a.k.a. the net revenue:

df['retail_price'] - df['wholesale_price']

Here, we are retrieving the series df['retail_price'] and subtracting from it the series df['wholesale_price']. Because these two series are parallel to one another, with identical indexes, the subtraction will take place for each row, and will return a new series with the same index, but with the difference between them.

Once we have that series, we’ll multiply it by the number of sales we had for each product:

(df['retail_price'] - df['wholesale_price']) * df['sales'] ❶

❶ Without parentheses, the * operator would have had precedence, messing up the calculation

This then result in a new series, one which shares an index with df, but whose values represent the total sales for each product. We can sum this together with the sum method:

((df['retail_price'] - df['wholesale_price']) * df['sales']).sum() ❶

❶ Now I’m using parentheses to indicate that I want to call sum on the series I get back from this set of operations, rather than direcly on df[‘sales’].

Solution

df = DataFrame([{'product_id':23, 'name':'computer', 'wholesale_price': 500,
                  'retail_price':1000, 'sales':100},
                {'product_id':96, 'name':'Python Workout', 'wholesale_price': 35,
                 'retail_price':75, 'sales':1000},
                {'product_id':97, 'name':'Pandas Workout', 'wholesale_price': 35,
                 'retail_price':75, 'sales':500},
                {'product_id':15, 'name':'banana', 'wholesale_price': 0.5,
                 'retail_price':1, 'sales':200},
                {'product_id':87, 'name':'sandwich', 'wholesale_price': 3,
                 'retail_price':5, 'sales':300},
                ])
  
 ((df['retail_price'] - df['wholesale_price']) * df['sales']).sum() ❶
 (df['price'] * df['sales']).sum()❶

❶ Returns 110700

Beyond the exercise

On what products is our retail price more than twice the wholesale price?
How much did the store make from food vs. computers vs. books? (You can just retrieve based on the index values, not anything more sophisticated.)
Because your store is doing so well, you’re able to negotiate a 30% discount on the wholesale price of goods. Calculate the new net income.

That’s all for this article. If you want to see more, check out the book on Manning’s liveBook platform here.

EXCERPT

Data Frames in Pandas

From Pandas Workout by Reuven Lerner

Written by Manning Publications