ON PANDAS

Become a Pro at Pandas, Python’s data manipulation Library

Julien Kervizic
May 9, 2019 · 7 min read

The pandas library is the most popular data manipulation library for Python. It provides an easy way to manipulate data through its data-frame API, inspired by R’s data frames.

Image for post
Image for post
Photo by Damian Patkowski on Unsplash

Understanding The pandas library

The core internal model of the data frame is a series of NumPy arrays, and pandas functions, such as the now deprecated “as_matrix” function, which return results in NumPy’s internal representation.

Pandas leverages other libraries to get data in and out of data-frames. SQL Alchemy, for instance, is used through the read_sql, and to_sql functions while openpyxl and xlsx writer are used for read_excel and to_excel functions.

Matplotlib and Seaborn, in turn, are used to provide an easy interface, to plot information available within a data frame, using a command such as df.plot()

Numpy’s Panda — Efficient pandas

APPLY & VECTORIZED OPERATIONS

The apply interface allows to gain some of the efficiency by using a CPython interface to do the looping:

df.apply(lambda x: x['col_a'] * x['col_b'], axis=1)

But most of the performance gain would be obtained from the use of vectorized operation themselves, be it directly in pandas or by calling its’ internal Numpy arrays directly.

Image for post
Image for post

As you can see from the picture above, the difference in performance can be drastic, between processing it with a vectorized operation (3.53ms) and looping with apply to do an addition (27.8s). Additional efficiencies can be obtained by directly invoking the NumPy’s arrays and API, e.g.:

Image for post
Image for post

Swifter: swifter is a python library that makes it easy to vectorize different types of operations on a data frame, its API is somewhat similar to that of the apply function

EFFICIENT DATA STORING THROUGH DTYPES

df.astype({'testColumn': str, 'testCountCol': float})

Dtypes are native objects from Numpy, which allows you to define the exact type and number of bits used to store certain pieces of information.

Numpy’s type np.dtype(‘int32’) would, for instance, represent a 32 bits long integer. Pandas default to 64 bits integer, we could save half the space by using 32 bits:

Image for post
Image for post

memory_usage() shows the number of bytes used by each of the columns, since there is only one entry (row) per column, the size of each int64 column is 8bytes and of int32 4bytes.

Pandas also introduces the categorical dtype, that allows for efficient memory utilization for frequently occurring values. In the example below, we can see a 28x decrease in memory utilization for the field posting_date when we converted it to a categorical value.

Image for post
Image for post

In our example, the overall size of the data-frame drops by more than 3X by just changing this data type:

Image for post
Image for post

Not only using the right dtypes allows you to handle larger datasets in memory, but it also makes some computations more efficient. In the example below, we can see that using the categorical type brought a 3X speed improvement for the groupby / sum operation.

Image for post
Image for post

Within pandas, you can define the dtypes, either during the data load (read_ ) or as a type conversion (astype).

CyberPandas: Cyber pandas is one of the different library extension that enables a wider variety of data types by supporting ipv4 and ipv6 data types and storing them efficiently.

HANDLING LARGE DATASETS WITH CHUNKS

Image for post
Image for post

The combination of defining a chunk size when reading a data source and the get_chunk method allows pandas to process data as an iterator, such as in the example shown above, where the data frame is read two rows at the time. We can then iterate through these chunks:

i = 0
for a in df_iter:
# do some processing chunk = df_iter.get_chunk()
i += 1
new_chunk = chunk.apply(lambda x: do_something(x), axis=1)
new_chunk.to_csv("chunk_output_%i.csv" % i )

The output of which can then be fed to a CSV file, pickled, exported to a database, etc.…

Setting up operator by chunks also allows certain operations to be performed through multi-processing.

Dask: is a framework built on top of Pandas and build with multi-processing and distributed processing in mind. It makes use of collections of chunks of pandas data-frames both in memory and on disk.

SQL Alchemy’s Pandas — Database Pandas

SQL TRANSACTIONS

with engine.begin() as conn:
df.to_sql(
tableName,
con=conn,
...
)

The advantage of using a SQL transaction, is the fact that the transaction would roll back should the data load fail.

SQL extension

Pandas has a few SQL extension such as pandasql a library that allows to perform SQL queries on top of data-frames. Through pandasql, the data-frame object can be queried as if they were database tables.

Image for post
Image for post

SQL UPSERTs

Pandas doesn’t natively support upsert exports to SQL on databases supporting this function. Patches to pandas exist to allow this feature.

MatplotLib/Seaborn — Visual Pandas

Extensions: Different extensions exist, such as Bokeh and Plotly, to provide interactive visualization within Jupyter notebooks, while it is also possible to extend matplotlib to handle 3D graphs.

Other Extensions

tqdm

tqdm is a progress bar extension in Python that interacts with pandas. It will enable the user to see the progress of maps and apply operations on pandas data frame when using the relevant function (progress_map and progress_apply):

Image for post
Image for post

PrettyPandas

PrettyPandas is a library that provides an easy way to format data-frames and to add table summaries to them:

Image for post
Image for post

Hacking Analytics

All around data & analytics topics

Julien Kervizic

Written by

Living at the interstice of business, data and technology | Solution Architect & Head of Data | Heineken, Facebook and Amazon | linkedin: https://bit.ly/2XbDffo

Hacking Analytics

All around data & analytics topics

Julien Kervizic

Written by

Living at the interstice of business, data and technology | Solution Architect & Head of Data | Heineken, Facebook and Amazon | linkedin: https://bit.ly/2XbDffo

Hacking Analytics

All around data & analytics topics

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store