First Steps in Pandas for Excel Users
If you’re using your own computer, you’ll have to install it doing pip install pandas
(assuming you're using pip
).
Pandas’ main data structure, the DataFrame, is a 2-dimensional(*) object which resembles our Excel tables. In this lesson we’ll analyze the parallelism between the two.
We'll start by reading the Excel file of this lesson.
Reading the Orders
sheet into a DataFrame
>>df = excel_file.parse('Orders')
The variable df
contains a Pandas DataFrame
, which includes all the information of the Orders
sheet in our excel file.
A DataFrame is similar to an Excel table:
>>df.head(10)
The main difference is that, when we work with DataFrames, we don’t always have a visual reference of the data. We know it’s in memory, and we know what each column contains, but we don’t have a view of the entire DataFrame. The reason for this is that Pandas is designed to work with millions of rows of data.
The Big Picture
Working with large volumes of data can be problematic. So, instead of always looking at the table of data directly, we like focus on the big picture: what does my DataFrame contain?. As you’ve seen before, the methods .head()
and .tail()
are very useful to have an idea of the data we're handling:
But we can also employ a few other methods or attributes to complement this reference. For example, we can inspect the shape
of our DataFrame, to understand how our data is structured:
In this case, shape
tells you that this DataFrame contains 292
rows/records and 11
columns. We can also check the detail of our columns
Important: Note that
shape
andcolumns
were invoked without parenthesis: justdf.shape
. That's because it's an attribute and not a method.
We also use a few methods that aggregate the whole data to give us a bigger picture. One of them is .describe()
, we'll also see .info()
when discussing column types:
As you can see, .describe()
provides a summary of all the "numeric" columns including count
(how many rows) and some other statistical methods: mean
, std
, etc.
Column Selection
To “zoom in” on your DataFrame, and select a specific column, the syntax we use is:
Column types
Each column in a Pandas DataFrame has a specific type. This is VERY important. The column has a fixed, pre-defined type, which means that we can’t mix types in the same column. This is different from Excel, in which we set a “formatting” type to the column, but we can store any value we want.
We can use the .info()
method to see the data type of each column:
Column types in Pandas
In Excel there are just three data types:
- Numeric (Number, Currency, Accounting, Date)
- Text
- Formula
Data types in Pandas are known as dtypes, and the main ones are:
- int and float: specific for integer and float numbers
- datetime and timedelta: specific for dates and offsets
- object: Just
strings
, they're used to store text, but pandas refers to them asobject
s. - bool: specific for True/False values
Numbers: int
and float
These are numeric columns and support all regular arithmetic operations. In our current orders DataFrame, these are numeric columns:
order_quantity
: intcost_price
: floatretail_price
: floatshipping_price
: float
Remember when we used the .describe()
method, only the numeric columns were included.
Numeric operations and methods
Numeric columns support multiple operations (as we’ll see in the following section) as well as some very useful methods, for example:
mean
: the average value of the column
>df['cost_price'].mean()25.202705479452057
std
: the standard deviation of the column
>>df['cost_price'].std()62.11414058290448
And a few others. But we can even run a .describe
in the column itself:
>>df['cost_price'].describe()count 292.000000
mean 25.202705
std 62.114141
min 0.240000
25% 1.870000
50% 3.750000
75% 13.640000
max 377.990000
Name: cost_price, dtype: float64
Arithmetic Operations
Numeric columns allow us to perform regular arithmetic operations with them. For example:
>>df['cost_price'].head()0 75.00
1 1.31
2 2.50
3 1.82
4 2.29
Name: cost_price, dtype: float64
We can increase the values by 10%:
>>df['cost_price'].head() * 1.100 82.500
1 1.441
2 2.750
3 2.002
4 2.519
Name: cost_price, dtype: float64
Please note that the previous operation didn’t update the underlying DataFrame, it just showed a result. Let’s look at cost_price
again:
>>df['cost_price'].head()0 75.00
1 1.31
2 2.50
3 1.82
4 2.29
Name: cost_price, dtype: float64
If you want to override the contents of a column, you’ll have to set it equals to that value:
>>df['cost_price'] = df['cost_price'] * 1.08
Now the changes are permanent:
>>df['cost_price'].head()
We could have also written:
# df['cost_price'] *= 1.08
We could also create new columns in our DataFrame. We do that by just setting the value of the column equals to a certain value. For example, I’m going to create the column Dummy Column
with only values -1
.
I can delete the column by using the del
keyword:
Let’s see a more useful example of a new column. Let’s say we want to calculate a new column total_price
, which is calculated using the following formula:
total_price=retail_price∗order_quantity+shipping_pricetotal_price=retail_price∗order_quantity+shipping_price
The way to do it is as simple as:
Datetimes (and timedeltas)
The following columns in our orders DataFrame are of type datetime
or also called timestamp:
order_date
ship_date
We’ll calculate the shipping_delay
by checking how many days we have between order_date
and ship_date
:
df['ship_date'].head()0 2013-08-16
1 2013-08-16
2 2014-09-05
3 2013-03-26
4 2013-06-21
Name: ship_date, dtype: datetime64[ns]
Datetime columns also support arithmetic operations, and the results are timedeltas (or offsets). For example:
7 days
is a "relative time", or a delta in time (a timedelta
). We can also use timedeltas in combination with our columns:
Strings (type object
)
Any type of text (long, short, containing special characters or not) will be of type object
in a DataFrame. Although in Python we know them as just strings. The following columns are strings:
city
product_name
product_category
Some operations are supported with strings, for example, a simple concatenation:
There are more advanced operations we could perform, using the .str
attribute: