Fluent Pandas

Munish Goyal
Analytics Vidhya
Published in
19 min readAug 6, 2020

Let’s uncover the practical details of Pandas’ Series, DataFrame, and Panel

Photo by Stan Y on Unsplash

Note to the Readers: Paying attention to comments in examples would be more helpful than going through the theory itself.

· Series (1D data structure: Column-vector of DataTable)
· DataFrame (2D data structure: Table)
· Panel (3D data structure)

Pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML framework support pandas data structures as inputs.

Pandas Data Structures

Refer Intro to Data Structures on Pandas docs.

The primary data structures in pandas are implemented as two classes: DataFrame and Series.

Import numpy and pandas into your namespace:

import numpy as np
import pandas as pd
import matplotlib as mpl
np.__version__
pd.__version__
mpl.__version__

Series (1D data structure: Column-vector of DataTable)

CREATING SERIES

Series is one-dimensional array having elements with non-unique labels (index), and is capable of holding any data type. The axis labels are collectively referred to as index. The general way to create a Series is to call:

pd.Series(data, index=index)

Here, data can be an NumPy’s ndarray, Python’s dict, or a scalar value (like 5). The passed index is a list of axis labels.

Note: pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time.

If data is list or ndarray (preferred way):

If data is an ndarray or list, then index must be of the same length as data. If no index is passed, one will be created having values [0, 1, 2, ... len(data) - 1].

If data is a scalar value:

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

If data is dict:

If data is a dict, and - if index is passed the values in data corresponding to the labels in the index will be pulled out, otherwise - an index will be constructed from the sorted keys of the dict, if possible

SERIES IS LIKE NDARRAY AND DICT COMBINED

Series acts very similar to an ndarray, and is a valid argument to most NumPy functions. However, things like slicing also slice the index. Series can be passed to most NumPy methods expecting an 1D ndarray.

A key difference between Series and ndarray is automatically alignment of the data based on labels during Series operations. Thus, you can write computations without giving consideration to whether the Series object involved has some non-unique labels. For example,

The result of an operation between unaligned Seriess will have the union of the indexes involved. If a label is not found in one series or the other, the result will be marked as missing NaN.

Also note that in the above example, the index 'b' was duplicated, so s['b'] returns pandas.core.series.Series.

Series is also like a fixed-sized dict on which you can get and set values by index label. If a label is not contained while reading a value, KeyError exception is raised. Using the get method, a missing label will return None or specified default.

SERIESNAME ATTRIBUTE

Series can also have a name attribute (like DataFrame can have columns attribute). This is important as a DataFrame can be seen as dict of Series objects.

The Seriesname will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame.

For example,

d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
d = pd.DataFrame(d)
d
# one two
# 0 1.0 4.0
# 1 2.0 3.0
# 2 3.0 2.0
# 3 4.0 1.0
type(d) #=> pandas.core.frame.DataFrame
d.columns #=> Index(['one', 'two'], dtype='object')
d.index #=> RangeIndex(start=0, stop=4, step=1)
s = d['one']
s
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# Name: one, dtype: float64
type(s) #=> pandas.core.series.Series
s.name #=> 'one'
s.index #=> RangeIndex(start=0, stop=4, step=1)

You can rename a Series with pandas.Series.rename() method or by just assigning new name to name attribute.

s = pd.Series(np.random.randn(5), name='something')
id(s) #=> 4331187280
s.name #=> 'something'
s.name = 'new_name'
id(s) #=> 4331187280
s.name #=> 'new_name'
s.rename("yet_another_name")
id(s) #=> 4331187280
s.name #=> 'yet_another_name'

COMPLEX TRANSFORMATIONS ON SERIES USING SERIES.APPLY

NumPy is a popular toolkit for scientific computing. Pandas’ Series can be used as argument to most NumPy functions.

For complex single-column transformation, you can use Series.apply. Like Python’s map function, Series.apply accepts as an argument a lambda function which is applied to each value.

DataFrame (2D data structure: Table)

Refer: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

DataFrame is a 2D labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table, or a dict of Series objects. Like Series, DataFrame accepts many different kinds of inputs.

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. Note that index can have non-unique elements (like that of Series). Similarly, columns names can also be non-unique.

If you pass an index and/or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index (similar to passing a dict as data to Series).

If axis labels (index) are not passed, they will be constructed from the input data based on common sense rules.

CREATING DATAFRAMEFrom a dict of ndarrays/lists

The ndarrays/lists must all be the same length. If an index is passed, it must clearly also be the same length as that of data ndarrays. If no index is passed, then implicit index will be range(n), where n is the array length.

For example,

From a dict of Series (preferred way):

The resultant index will be the union of the indexes of the various Series (each Series may be of a different length and may have different index). If there are nested dicts, these will be first converted to Series. If no columns are passed, the columns will be list of dict keys. For example,

The row and column labels can be accessed respectively by accessing the index and columns attributes.

From a list of dicts:

For example,

data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]pd.DataFrame(data2)
# a b c
# 0 1 2 NaN
# 1 5 10 20.0
pd.DataFrame(data2, index=['first', 'second'])
# a b c
# first 1 2 NaN
# second 5 10 20.0
pd.DataFrame(data2, columns=['a', 'b'])
# a b
# 0 1 2
# 1 5 10

From a Series:

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

For example,

s = pd.Series([1., 2., 3.], index=['a', 'b', 'c'])
type(s) #=> pandas.core.series.Series
df2 = pd.DataFrame(s)
df2
# 0
# a 1.0
# b 2.0
# c 3.0
type(df2) #=> pandas.core.frame.DataFrame
df2.columns #=> RangeIndex(start=0, stop=1, step=1)
df2.index #=> Index(['a', 'b', 'c'], dtype='object')

From a Flat File

The pandas.read_csv (preferred way):

You can read CSV files into a DataFrame using pandas.read_csv() method. Refer to the official docs for its signature.

For example,

CONSOLE DISPLAY AND SUMMARY

Some helpful methods and attributes:

Wide DataFrames will be printed (print) across multiple rows by default. You can change how much to print on a single row by setting display.width option. You can adjust the max width of the individual columns by setting display.max_colwidth.

pd.set_option('display.width', 40)     # default is 80
pd.set_option('display.max_colwidth', 30)

You can also display display.max_colwidth feature via the expand_frame_repr option. This will print the table in one block.

INDEXING ROWS AND SELECTING COLUMNS

The basics of DataFrame indexing and selecting are as follows:

For example,

d = {
'one' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'a']),
'two' : pd.Series(['A', 'B', 'C', 'D'], index=['a', 'b', 'c', 'a'])
}
df = pd.DataFrame(d)
df
# one two
# a 1.0 A
# b 2.0 B
# c 3.0 C
# a 4.0 D
type(df['one']) #=> pandas.core.series.Series
df['one']
# a 1.0
# b 2.0
# c 3.0
# a 4.0
# Name: one, dtype: float64
type(df[['one']]) #=> pandas.core.frame.DataFrame
df[['one']]
# one
# a 1.0
# b 2.0
# c 3.0
# a 4.0
type(df[['one', 'two']]) #=> pandas.core.frame.DataFrame
df[['one', 'two']]
# one two
# a 1.0 A
# b 2.0 B
# c 3.0 C
# a 4.0 D
type(df.loc['a']) #=> pandas.core.frame.DataFrame
df.loc['a']
# one two
# a 1.0 A
# a 4.0 D
type(df.loc['b']) #=> pandas.core.series.Series
df.loc['b']
# one 2
# two B
# Name: b, dtype: object
type(df.loc[['a', 'c']]) #=> pandas.core.frame.DataFrame
df.loc[['a', 'c']]
# one two
# a 1.0 A
# a 4.0 D
# c 3.0 C
type(df.iloc[0]) #=> pandas.core.series.Series
df.iloc[0]
# one 1
# two A
# Name: a, dtype: object
df.iloc[1:3]
# one two
# b 2.0 B
# c 3.0 C
df.iloc[[1, 2]]
# one two
# b 2.0 B
# c 3.0 C
df.iloc[[1, 0, 1, 0]]
# one two
# b 2.0 B
# a 1.0 A
# b 2.0 B
# a 1.0 A
df.iloc[[True, False, True, False]]
# one two
# a 1.0 A
# c 3.0 C

COLUMN ADDITION AND DELETION

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations.

When inserting a Series that doesn’t have the same index as the DataFrame, it will be conformed to the DataFrame‘s index (that is, only values with index matching DataFrame‘s existing index will be added, and missing index will get NaN (of the same dtype as dtype of that particular column) as value.

When inserting a columns with scalar value, it will naturally be propagated to fill the column.

When you insert a same length (as that of DataFrame to which it is inserted) ndarray or list, it just uses existing index of the DataFrame. But, try not to use ndarrays or list directly with DataFrames, intead you can first convert them to Series as follows:

df['yet_another_col'] = array_of_same_length_as_df# is same asdf['yet_another_col'] = pd.Series(array_of_same_length_as_df, index=df.index)

For example,

By default, columns get inserted at the end. The insert() method is available to insert at a particular location in the columns.

Columns can be deleted using del, like keys of dict.

DATA ALIGNMENT AND ARITHMETICArithmetics between DataFrame objects

Data between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels. For example,

Important: You might like to try above example with duplicate columns names and index values in each individual data frame.

Boolean operators (for example, df1 & df2) work as well.

Arithmetics between DataFrame and Series:

When doing an operation between DataFrame and Series, the default behavior is to broadcast Series row-wise to match rows in DataFrame and then arithmetics is performed. For example,

In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise. For example,

Here pd.date_range() is used to create fixed frequency DatetimeIndex, which is then used as index (rather than default index of 0, 1, 2, ...) for a DataFrame.

For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.

Arithmetics between DataFrame and Scalars

Operations with scalars are just as you would expect: broadcasted to each cell (that is, to all columns and rows).

DATAFRAME METHODS AND FUNCTIONS

Evaluating string describing operations using eval() method

Note: Rather use assign() method.

The eval() evaluates a string describing operations on DataFrame columns. It operates on columns only, not specific rows or elements. This allows eval() to run arbitrary code, which can make you vulnerable to code injection if you pass user input into this function.

df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})df
# A B
# 0 1 10
# 1 2 8
# 2 3 6
# 3 4 4
# 4 5 2
df.eval('2*A + B')
# 0 12
# 1 12
# 2 12
# 3 12
# 4 12
# dtype: int64

Assignment is allowed though by default the original DataFrame is not modified. Use inplace=True to modify the original DataFrame. For example,

df.eval('C = A + 2*B', inplace=True)df
# A B C
# 0 1 10 21
# 1 2 8 18
# 2 3 6 15
# 3 4 4 12
# 4 5 2 9

Assigning new columns to the copies in method chains — assign() method

Inspired by dplyer‘s mutate verb, DataFrame has an assign() method that allows you to easily create new columns that are potentially derived from existing columns.

The assign() method always returns a copy of data, leaving the original DataFrame untouched.

Note: Also check pipe() method.

df2 = df.assign(one_ratio = df['one']/df['out_of'])df2
# one two one_trunc out_of const one_ratio
# a 1.0 1.0 1.0 100 1 0.01
# b 2.0 2.0 2.0 100 1 0.02
# c 3.0 3.0 NaN 100 1 0.03
# d NaN 4.0 NaN 100 1 NaN
id(df) #=> 4436438040
id(df2) #=> 4566906360

Above was an example of inserting a precomputed value. We can also pass in a function of one argument to be evaluated on the DataFrame being assigned to.

df3 = df.assign(one_ratio = lambda x: (x['one']/x['out_of']))
df3
# one two one_trunc out_of const one_ratio
# a 1.0 1.0 1.0 100 1 0.01
# b 2.0 2.0 2.0 100 1 0.02
# c 3.0 3.0 NaN 100 1 0.03
# d NaN 4.0 NaN 100 1 NaN
id(df) #=> 4436438040
id(df3) #=> 4514692848

This way you can remove a dependency by not having to use name of the DataFrame.

Appending rows with append() method

The append() method appends rows of other_data_frame DataFrame to the end of current DataFrame, returning a new object. The columns not in the current DataFrame are added as new columns.

Its most useful syntax is:

<data_frame>.append(other_data_frame, ignore_index=False)

Here,

  • other_data_frame: Data to be appended in the form of DataFrame or Series/dict-like object, or a list of these.
  • ignore_index: By default it is False. If it is True, then index labels of other_data_frame are ignored

Note: Also check concat() function.

For example,

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df
# A B
# 0 1 2
# 1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df2
# A B
# 0 5 6
# 1 7 8
df.append(df2)
# A B
# 0 1 2
# 1 3 4
# 0 5 6
# 1 7 8
df.append(df2, ignore_index=True)
# A B
# 0 1 2
# 1 3 4
# 2 5 6
# 3 7 8

The drop() method

Note: Rather use del as stated in Column Addition and Deletion section, and indexing + re-assignment for keeping specific rows.

The drop() function removes rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

The values attribute and copy() method

The values attribute

The values attribute returns NumPy representation of a DataFrame‘s data. Only the values in the DataFrame will be returned, the axes labels will be removed. A DataFrame with mixed type columns (e.g. str/object, int64, float32) results in an ndarray of the broadest type that accommodates these mixed types.

Check Console Display section for an example.

The copy() method

The copy() method makes a copy of the DataFrame object’s indices and data, as by default deep is True. So, modifications to the data or indices of the copy will not be reflected in the original object.

If deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vica versa).

Its syntax is:

df.copy(deep=True)

Transposing using T attribute or transpose() method

Refer section Arithmetic, matrix multiplication, and comparison operations.

To transpose, you can call the method transpose(), or you can the attribute T which is accessor to transpose() method.

The result is a DataFrame as a reflection of original DataFrame over its main diagonal by writing rows as columns and vice-versa. Transposing a DataFrame with mixed dtypes will result in a homogeneous DataFrame with the object dtype. In such a case, a copy of the data is always made.

For example,

Sorting (sort_values(), sort_index()), Grouping (groupby()), and Filtering (filter())

The sort_values() method

Dataframe can be sorted by a column (or by multiple columns) using sort_values() method.

For example,

The sort_index() method

The sort_index() method can be used to sort by index.

For example,

The groupby() method

The groupby() method is used to group by a function, label, or a list of labels.

For example,

The filter() method

The filter() method returns subset of rows or columns of DataFrame according to labels in the specified index. Note that this method does not filter a DataFrame on its contents, the filter is applied to the labels of the index, or to the column names.

You can use items, like and regex parameters, but note that they are enforced to be mutually exclusive. The parameter axis default to the info axis that is used when indexing with [].

For example,

Melting and Pivoting using melt() and pivot() methods

The idea of melt() is to keep keep few given columns as id-columns and convert rest of the columns (called variable-columns) into variable and value, where the variable tells you the original columns name and value is corresponding value in original column.

If there are n variable-columns which are melted, then information from each row from the original formation is not spread to n columns.

The idea of pivot() is to do just the reverse.

For example,

Piping (chaining) Functions using pipe() method

Suppose you want to apply a function to a DataFrame, Series or a groupby object, to its output then apply other, other, … functions. One way would be to perform this operation in a “sandwich” like fashion:

Note: Also check assign() method.

df = foo3(foo2(foo1(df, arg1=1), arg2=2), arg3=3)

In the long run, this notation becomes fairly messy. What you want to do here is to use pipe(). Pipe can be though of as a function chaining. This is how you would perform the same task as before with pipe():

df.pipe(foo1, arg1=1).
pipe(foo2, arg2=2).
pipe(foo3, arg3=3)

This way is a cleaner way that helps keep track the order in which the functions and its corresponding arguments are applied.

Rolling Windows using rolling() method

Use DataFrame.rolling() for rolling window calculation.

Other DataFrame Methods

Refer Methods section in pd.DataFrame.

Refer Computational Tools User Guide.

Refer the categorical listing at Pandas API.

APPLYING FUNCTIONSThe apply() method: apply on columns/rows

The apply() method applies the given function along an axis (by default on columns) of the DataFrame.

Its most useful form is:

df.apply(func, axis=0, args, **kwds)

Here:

  • func: The function to apply to each column or row. Note that it can be a element-wise function (in which case axis=0 or axis=1 doesn’t make any difference) or an aggregate function.
  • axis: Its value can be 0 (default, column) or 1. 0 means applying function to each column, and 1 means applying function to each row. Note that this axis is similar to how axis are defined in NumpPy, as for 2D ndarray, 0 means column.
  • args: It is a tuple and represents the positional arguments to pass to func in addition to the array/series.
  • **kwds: It represents the additional keyword arguments to pass as keywords arguments to func.

It returns a Series or a DataFrame.

For example,

The applymap() method: apply element-wise

The applymap() applies the given function element-wise. So, the given func must accept and return a scalar to every element of a DataFrame.

Its general syntax is:

df.applymap(func)

For example,

When you need to apply a function element-wise, you might like to check first if there is a vectorized version available. Note that a vectorized version of func often exist, which will be much faster. You could square each number element-wise using df.applymap(lambda x: x**2), but the vectorized version df**2 is better.

WORKING WITH MISSING DATA

Refer SciKit-Learn’s Data Cleaning section.

Refer Missing Data Guide and API Reference for Missing Data Handling: dropna, fillna, replace, interpolate.

Also check Data Cleaning section of The tf.feature_column API on other options.

Also go through https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/

NORMALIZING DATA

One way is to perform df / df.iloc[0], which is particular useful while analyzing stock price over a period of time for multiple companies.

THE CONCAT() FUNCTION

The concat() function performs concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axis.

The default axis of concatenation is axis=0, but you can choose to concatenate data frames sideways by choosing axis=1.

Note: Also check append() method.

For example,

df1 = pd.DataFrame(
{
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3]
)
df2 = pd.DataFrame(
{
'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7]
)
df3 = pd.DataFrame(
{
'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']
}, index=[1, 2, 3, 4]
)
frames = [df1, df2, df3]df4 = pd.concat(frames)df4
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# 1 A8 B8 C8 D8
# 2 A9 B9 C9 D9
# 3 A10 B10 C10 D10
# 4 A11 B11 C11 D11
df5 = pd.concat(frames, ignore_index=True)df5
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# 8 A8 B8 C8 D8
# 9 A9 B9 C9 D9
# 10 A10 B10 C10 D10
# 11 A11 B11 C11 D11
df5 = pd.concat(frames, keys=['s1', 's2', 's3'])df5
# A B C D
# s1 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# s2 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# s3 1 A8 B8 C8 D8
# 2 A9 B9 C9 D9
# 3 A10 B10 C10 D10
# 4 A11 B11 C11 D11
df5.index
# MultiIndex(levels=[['s1', 's2', 's3'], [0, 1, 2, 3, 4, 5, 6, 7]],
# labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4]])

Like its sibling function on ndarrays, numpy.concatenate(), the pandas.concat() takes a list or dict of homogeneously-typed objects and concatenates them with some configurable handling of “what to do with other axes”.

MERGING AND JOINING USING MERGE() AND JOIN() FUNCTIONS

Refer Mergem Join, and Concatenate official guide.

The function merge() merges DataFrame object by performing a database-style join operation by columns or indexes.

The function join() joins columns with other DataFrame either on index or on a key column.

BINARY DUMMY VARIABLES FOR CATEGORICAL VARIABLES USING GET_DUMMIES() FUNCTION

To convert a categorical variable into a “dummy” DataFrame can be done using get_dummies():

df = pd.DataFrame({'char': list('bbacab'), 'data1': range(6)})df
# char data1
# 0 b 0
# 1 b 1
# 2 a 2
# 3 c 3
# 4 a 4
# 5 b 5
dummies = pd.get_dummies(df['char'], prefix='key')
dummies
# key_a key_b key_c
# 0 0 1 0
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
# 4 1 0 0
# 5 0 1 0

PLOTTING DATAFRAME USING PLOT() FUNCTION

The plot() function makes plots of DataFrame using matplotlib/pylab.

Panel (3D data structure)

Panel is a container for 3D data. The term panel data is derived from econometrics and is partially responsible for the name: pan(el)-da(ta)-s.

The 3D structure of a Panel is much less common for many types of data analysis, than the 1D of the Series or the 2D of the DataFrame. Oftentimes, one can simply use a Multi-index DataFrame for easily working with higher dimensional data. Refer Deprecate Panel.

Here are some related interesting stories that you might find helpful:

--

--

Munish Goyal
Analytics Vidhya

Designing and building large-scale data-intensive cloud-based applications/APIs.