Pandas: Under the hood and tips/tricks for data science and data wrangling

Manikandan Jeeva
12 min readMar 1, 2020

--

Photo by chuttersnap on Unsplash

Visualizing data in a table format is intuitive than a multidimensional numpy array, key-value pair, or in an entity level document structure (similar to NoSQL). The visualization community may ignore tables as a core visualization element, but tables still occupy 30% to 40% of any business performance dashboard.

“Above all else show the data” ― Edward R. Tufte, The Visual Display of Quantitative Information

As data scientists and data engineers, We are accustomed to seeing data in an excel worksheet or SQL table with columns and rows. Normalized database design (storing data across multiple tables with relations as primary and foreign keys) works well in a transaction based application that prioritizes storage space over operations performed on top of the data. In the data science world, it is always preferable to have a single de-normalized data table will all data elements in it so that both labels and independent variables are in a single row.

I was a core SAS programmer/architect (in the given sequence) for almost a decade, and when I switched to Python five years back, the very first thing I dearly missed was working with a SAS dataset. In SAS, data is available in a tabular format. It can be accessed one row at a time (data step handles the iteration across the entire dataset). We can also use a procedure to aggregate the SAS dataset across multiple rows (similar to group by in SQL) and calculate aggregation metrics like sum, count, or average. In Python, Pandas were there for my rescue. It empowered me to work with tabular data instantly but also kept me away from the quintessential numpy arrays for an extended period.

In this blog-post, let’s discuss the following topics

  1. Pandas — Under the hood (Internal mechanics)
  2. Scenarios in which pandas wins hands-down
  3. Methods to empower and stretch pandas (top 10 tips and trick I use in my data analysis projects)
  4. Finally on pandas alternatives when pandas leave us all alone with an out-of-memory error
  5. What’s new in Pandas 1.0.0

Pandas : Under the hood

Pandas was authored by Wes McKinney in 2008 and it became a NumFOCUS sponsored project in 2015. It’s codebase was written in Python (~92%) along with embeddings to numpy (for scientific computing) and matplotlib (for data visualization). It is also maintained by an active community of ~1.8k contributors and a supportive community, it is the best place to start your Python code contribution. With the vision to provide accessible data analytics and manipulation software to everyone, Pandas has put Python on the radar of data scientists as it makes life more easier to work with structured labeled dataset — to explore, shape, analyze and visualize data. Python statistical packages like Statsmodel, Scikit-learn, and TensorFlow integrate well with Pandas data structures.

To understand the basics of any package or language, I believe the user has to understand two essential items

a. Details about the basic and derived data types

b. Fundamental data structure supported by the package

Pandas offer the following dtypes (data types), and using the right dtype for the given input data column is crucial for production-grade data analysis routines

  • Object: used for text, numeric, and non-numeric values. If the given data does not fit any of the below dtype, then object type is assigned to it. It is both the blessing and curse in a package. There is a reason why objects are not the way to go for the performance-tuned codebases
  • int64: Integer numbers — covers both signed and unsigned integers along with the varying variable-length — 8, 16, 32 and 64
  • float64: Floating point numbers — covers the lengths 16, 32 and 64
  • bool: True and False values
  • datetime64: covers the date and time values
  • timedelta[ns]: used to capture the difference between two DateTime values. this dtype is helpful when the user is working with time-series data
    category: used to cover a finite list of text values

Pandas 0.2*.* does not have a separate dtype to cover strings. Pandas 1.0.0 release has a dedicated Strings dtype

With respect to data structures, Pandas keeps it very simple, limiting it to two primary data structures 1. Series, and 2. Dataframes. An index (or label) is part of both the data structures, and data alignment is intrinsic until the link between labels and data is broken manually (numpy mostly assumes index == label). The length of the label vector should match that of the data vector. If there is no label vector available during loading, Pandas generate a label list for the data structure.

  • Series: 1-dimensional array-like homogeneous (same dtype) elements. Series is similar to holding a single row of records with index acting as the column name
  • Dataframe: 2-dimensional tabular structure with columns of different data types. Dateframe has two indexes — one for the row and one to refer a column. The data structure is mutable to add more aligning rows or new columns to the existing data element

The best way to make the connections between data structures is to think of a higher data structure as a combination of a lower one. A series is like an array and contains only one dtype across the entire data structure. A dataframe is like a combination of multiple series to hold different types of data.

For dataframes behind the scene, Pandas manages similar typed columns as blocks and uses a Block Manager to translate between dataframes and blocks. Block Manager handles indexes and how to provide the user with the sense of working with a tabular data structure.

Scenarios in which pandas wins hands-down

  1. Loading data from multiple data sources is like a breeze in Pandas and executed in a few lines of code. IO tools are available for loading flat files (both CSV and delimited), Excel files, and databases. Also can employ HDF5 to perform ultrafast saving/loading data from disk
  2. When the user is more comfortable with SQL language and knows how to use group by statements properly. The group by functionality in Pandas performs the split-apply-combine operations on a dataset in an optimized way
  3. Splitting the data into groups based on given criteria
  4. Applying a function to each group independently
  5. Combining the results into a data structure
  6. Pandas is a must-have package when the user is dealing with a time-series dataset (financial and IOT generated). Features like date range generation, frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging are to mention a few
  7. Easier to pivot and reshape (stack and unstack) data sets along with label-based slicing, indexing, and subsetting
  8. Hierarchical indexing of axes are required to retrieve information in a much faster manner
  9. Pandas powered by matplotlib has the right set of visualization tools for quick data exploration via simple data visualizations. It’s much easier to apply a data function on top of a Pandas dataframe and use the subset for a quick scatter plot
  10. Function chaining can help users to write shorter code to perform multiple sequential steps — one which is dependent on the previous step’s result. Proper comments should be added as part of function chaining to explain the logic in simple English as it is a bit confusing for someone who is looking at the code for the first time

Methods to empower and stretch pandas (top 10 tips and tricks I use in my regular data analysis projects)

We are using housing price data set (not Boston but Bengaluru) for this exercise and also Pandas 0.2*.* version. The entire Python code is available in the following Github location https://github.com/manikandanj2207/dataibreathe/blob/master/07.Blog_05_Pandas_Tips_and_Tricks.py

  • In Python, most of the data structure comes with an iterable. Iterating through a Python data structure element by element via a for or while loop is second nature to a Python programmer in handling data elements. Seeing a Pandas column as a list of elements attached to a positional index is obvious, and it is common to loop through the elements one at a time. Looping across elements may get the job done, but there exists a more efficient way of handling Pandas data structures. Think of Pandas dataframe column of series as a single vector, and operations need to be structured in such a way to handle the entire vector instead of handling each element one at a time. In the following example, there are three ways in which we are calculating the carpet area from the total area in square foot. Function carpet_area_f1 uses an iterator to loop through the dataframe using .iloc and calculate the carpet area. Function carpet_area_f2 uses a build-in Pandas iterator .iterrows to iterate across column elements. The third method employs .apply to calculate the carpet area in a single shot. The third method is much better since the execution time is in ms compared to seconds in the case of the first two functions
  • As Pandas dataframe and Python objects exist in memory temporarily. It is better to load only the columns that are required and also define the column dtype properly as the default dtype is the object type. Object type demands more memory compared to float or int. In the below example, loading the entire dataset takes close to 5MB in memory. If we are loading only the required columns, the dataframe uses 3MB, and when we define proper dtype, the dataframe size reduces to 2MB. We are playing with a small dataset here, think about the efficiency and performance gain with a massive dataset
  • Pandas have a great method to append similarly structured dataframes together. We can use this function to read multiple data files from a folder of the same structure with few lines of code. We are using glob to list the data files and using Pandas concat to join them into a single dataframe. Pandas use either a C or Python engine to read CSV file
  • Splitting a dataframe into a train and test dataframe is a general operation. Sci-kit Learn package comes with a function to split and shuffle a given dataset. If we do not want to load another package and want to the splitting using Pandas, we can use the Pandas sample function. Below we are sampling 80% of the dataframe into a train dataframe and using the rest as test dataframe
  • Splitting a single column into multiple columns based on a given delimiter is a common data manipulation function. In housing price data, the size column contains the number of bedrooms in the format {Number of bedrooms} + space + {String ‘BHK’ or String ‘Bedroom’}. If we need to split this column and pick the first value into another column, use str.split function. Below we are renaming the size column to property_size and then splitting and picking the first value
  • The num_of_bedrooms we got from point #5 uses object dtype, and it should be converted to a numeric dtype before we can apply any aggregation. Below we use the Pandas to_numeric function along with the fillna option to fill the objects with conversion errors into zero
  • The objective of storing data in a tabular structure is in the ease of aggregate numeric columns based on some categorical columns. Pandas enable the user with an array of aggregate functions along with function chaining. Pandas apply a powerful split-apply-combine logic behind the scene to make the aggregate functions effective. In the following example, Pandas groups rows by location, split into smaller chunks based on location value, and then an aggregate function is applied on top of the chucks. The final results are combined
  • There are often scenarios in data science projects to convert a continuous feature into a categorical feature. Pandas cut function helps the user to achieve the same in a few lines. Below we need to convert the number_of_bedrooms from a continuous feature to a categorical one. Cut function takes a continuos column along with bins and labels. Bins define the boundary of the labels. i.e., bin 0 to 1 defines the label A.
  • Pandas come with a strong pivot functionality. Pandas pivot function demands an index — row detail of a pivot, columns — columns of a pivot, values — the numeric measures used within the pivot cells and aggfunc — an aggregate function used for similar index and column. Pandas also come with stack and unstack functions.
  • Memory is not persistent, and the Pandas dataframe is not available once the Python session ends. To make Pandas dataframe persistent and also to work on data files that do not fit into memory, we can use pytables or HDF5. HDFStore provides the high-level API to work with HDF5 storage. A large data file can be broken into smaller chunks and loaded into HDF5 as multiple small dataframe with a dataframe level index. The user has to look for a dataframe based on an index/path and then pull the required data from the dataframe. HDF5 allows working with massive data files and also persist the dataframe in the disk (avoiding the data load step every time one has to use the data). In the below example, we are creating the HDF5 store named Bengaluru_House_Data.h5 and storing bengaluru_housing_pd dataframe with the identifier bengaluru_data_T1. Compression can be applied and works well when the dataframe contains all columns as numeric dtype

Finally, on pandas alternatives when pandas leave us all alone with an out-of-memory error

In python, an object contains header information and requires more computation and memory space. That’s precisely why a Python list (an object referring to a list of object elements) is comparatively slower than a Numpy array (numeric values are stored as is in memory without much overhead). Jake VanderPlas has written a great article on why python is slow and mentions the bottleneck of object type and overhead. As we have seen, Pandas default type is objects, and dtype comes with its performance issues.

When the user has to work on a large data file, and Pandas fails after all optimization techniques listed above, then it is time to look for Pandas alternative listed below. Dask is my favorite as it is closer to Pandas syntactically.

  • Dask
  • SFrame
  • Modin
  • Spark
  • Swifter

Also, check for Pandas extensions to define custom data types and how to extend Pandas functionality.

What’s new in Pandas 1.0.0?

On Jan 29th, 2020, Pandas stable version 1.0.0 got released. It comes with significant performance upgrades over the existing versions.

  • apply() now has the option to use Numba as an engine instead of Cython. Numba engine should significantly outperform Cython engine
  • convert_dtypes() can scan the dataframe frame column and assign appropriate dtypes
  • Converting dataframes to markdown is much easier. Use to_markdown()
  • Experimental: This might be stable — Dedicated string and boolean dtypes in this version
  • Experimental: This might be stable — new pandas.NA value (singleton) represent scaler missing values. In the previous version, the missing value is represented either by None or numpy.nan for object dtype and pandas.NaT for DateTime. Now pandas.NA can represent missing value across dtypes

Happy coding, and keep learning.

--

--

Manikandan Jeeva

Data science enthusiast. Python & SAS addict who solves key business problems in pharma commercial analytics. Asst Vice President @ Genpact | genpact.com