First Steps in Pandas for Excel Users

Published in

Analytics Vidhya

6 min readNov 12, 2019

If you’re using your own computer, you’ll have to install it doing pip install pandas (assuming you're using pip).

Pandas’ main data structure, the DataFrame, is a 2-dimensional(*) object which resembles our Excel tables. In this lesson we’ll analyze the parallelism between the two.

We'll start by reading the Excel file of this lesson.

Reading the `Orders` sheet into a DataFrame

>>df = excel_file.parse('Orders')

The variable df contains a Pandas DataFrame, which includes all the information of the Orders sheet in our excel file.

A DataFrame is similar to an Excel table:

>>df.head(10)

The main difference is that, when we work with DataFrames, we don’t always have a visual reference of the data. We know it’s in memory, and we know what each column contains, but we don’t have a view of the entire DataFrame. The reason for this is that Pandas is designed to work with millions of rows of data.

The Big Picture

Working with large volumes of data can be problematic. So, instead of always looking at the table of data directly, we like focus on the big picture: what does my DataFrame contain?. As you’ve seen before, the methods .head() and .tail() are very useful to have an idea of the data we're handling:

But we can also employ a few other methods or attributes to complement this reference. For example, we can inspect the shape of our DataFrame, to understand how our data is structured:

In this case, shape tells you that this DataFrame contains 292 rows/records and 11 columns. We can also check the detail of our columns

Important: Note that shape and columns were invoked without parenthesis: just df.shape. That's because it's an attribute and not a method.

We also use a few methods that aggregate the whole data to give us a bigger picture. One of them is .describe(), we'll also see .info() when discussing column types:

As you can see, .describe() provides a summary of all the "numeric" columns including count (how many rows) and some other statistical methods: mean, std, etc.

Column Selection

To “zoom in” on your DataFrame, and select a specific column, the syntax we use is:

Column types

Each column in a Pandas DataFrame has a specific type. This is VERY important. The column has a fixed, pre-defined type, which means that we can’t mix types in the same column. This is different from Excel, in which we set a “formatting” type to the column, but we can store any value we want.

We can use the .info() method to see the data type of each column:

Column types in Pandas

In Excel there are just three data types:

Numeric (Number, Currency, Accounting, Date)
Text
Formula

Data types in Pandas are known as dtypes, and the main ones are:

int and float: specific for integer and float numbers
datetime and timedelta: specific for dates and offsets
object: Just strings, they're used to store text, but pandas refers to them as objects.
bool: specific for True/False values

Numbers: `int` and `float`

These are numeric columns and support all regular arithmetic operations. In our current orders DataFrame, these are numeric columns:

order_quantity: int
cost_price: float
retail_price: float
shipping_price: float

Remember when we used the .describe() method, only the numeric columns were included.

Numeric operations and methods

Numeric columns support multiple operations (as we’ll see in the following section) as well as some very useful methods, for example:

mean: the average value of the column

>df['cost_price'].mean()25.202705479452057

std: the standard deviation of the column

>>df['cost_price'].std()62.11414058290448

And a few others. But we can even run a .describe in the column itself:

>>df['cost_price'].describe()count    292.000000
mean      25.202705
std       62.114141
min        0.240000
25%        1.870000
50%        3.750000
75%       13.640000
max      377.990000
Name: cost_price, dtype: float64

Arithmetic Operations

Numeric columns allow us to perform regular arithmetic operations with them. For example:

>>df['cost_price'].head()0    75.00
1     1.31
2     2.50
3     1.82
4     2.29
Name: cost_price, dtype: float64

We can increase the values by 10%:

>>df['cost_price'].head() * 1.100    82.500
1     1.441
2     2.750
3     2.002
4     2.519
Name: cost_price, dtype: float64

Please note that the previous operation didn’t update the underlying DataFrame, it just showed a result. Let’s look at cost_price again:

>>df['cost_price'].head()0    75.00
1     1.31
2     2.50
3     1.82
4     2.29
Name: cost_price, dtype: float64

If you want to override the contents of a column, you’ll have to set it equals to that value:

>>df['cost_price'] = df['cost_price'] * 1.08

Now the changes are permanent:

>>df['cost_price'].head()

We could have also written:

# df['cost_price'] *= 1.08

We could also create new columns in our DataFrame. We do that by just setting the value of the column equals to a certain value. For example, I’m going to create the column Dummy Column with only values -1.

I can delete the column by using the del keyword:

Let’s see a more useful example of a new column. Let’s say we want to calculate a new column total_price, which is calculated using the following formula:

total_price=retail_price∗order_quantity+shipping_pricetotal_price=retail_price∗order_quantity+shipping_price

The way to do it is as simple as:

Datetimes (and timedeltas)

The following columns in our orders DataFrame are of type datetime or also called timestamp:

order_date
ship_date

We’ll calculate the shipping_delay by checking how many days we have between order_date and ship_date:

df['ship_date'].head()0   2013-08-16
1   2013-08-16
2   2014-09-05
3   2013-03-26
4   2013-06-21
Name: ship_date, dtype: datetime64[ns]

Datetime columns also support arithmetic operations, and the results are timedeltas (or offsets). For example:

7 days is a "relative time", or a delta in time (a timedelta). We can also use timedeltas in combination with our columns:

Strings (type `object`)

Any type of text (long, short, containing special characters or not) will be of type object in a DataFrame. Although in Python we know them as just strings. The following columns are strings: