Pandas: Data Transformation

Ethan Guyant
Inquisitive Nature
11 min readAug 18, 2022

--

Photo by Suzanne D. Williams on Unsplash

Pandas is a data analysis and manipulation library which provides three fundamental data structures: Index, Series, and DataFrame. For more details on the data structures and an overview of the pandas library see a previous article - Pandas: Introduction to the Library.

The focus of this article is to service as a introduction and guide to the basic functions for modifying and manipulating the core data structures, including handling of missing data, DataFrame transformation, aggregating data, and joining DataFrames.

Handling of Missing Data

Datasets are typically not clean and ready for use, including that they may contain some level of missing data. How missing data is indicated can vary depending on the data source which can add an additional level of complication when handling the missing data.

Generally, there is no single optimal choice to indicate the presence of missing data. Typically, the indication of missing data center around using a mask that indicates missing values or a sentinel value that indicates a missing value.

Missing Data in Pandas

Within pandas missing data is indicated by one of the two Python null values: NaN (Not a Number) indicating a missing floating-point value or the None object, and pandas can convert between the two when needed.

Missing values will propagate through arithmetic operations between pandas objects. Descriptive statistics and certain computational methods (Series and DataFrame) all account for missing data (e.g. when summing data missing values will be treated as zero).

Null Value Operations

Pandas offer various methods for identifying, removing, and replacing missing values within the data structures.

Identifying Null or Missing Values

pandas.isna(obj) (documentation)

  • A function that takes a scalar or array-like object and is used to indicate whether there are missing values.
  • Returns a scalar or array of boolean values indicating whether each element is missing

Series.isna() (documentation)
DataFrame.isna() (documentation)

  • A function used to identify missing values within a Series or DataFrame
  • Returns a boolean same-sized object indicating whether an element is a missing value. None or NaN get mapped to True and everything else is mapped to False
  • .isnull() is an alias for .isna()
  • .notna() performs the opposite operation of .isna() — Non-missing values get mapped to True
  • Boolean masks, like the object returned by isna(), can be used directly as a Series or DataFrame index. For more detail see: Pandas: Indexing and Slicing

Characters such as an empty string ' ' or np.inf are not considered missing values unless pandas.options.mod.use_inf_as_na = True is set.

Removing Null or Missing Values

Series.dropna(axis=0, inplace=False, how=None) (documentation)

A function that returns a new series with missing values removed

  • axis (0 or ‘index’): a Series has only one axis to drop values from
  • inplace (bool): if True the operation of removing missing values is done inplace and .dropna() returns None
  • how (str): kept for compatibility, no currently used

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) (documentation)

  • axis {0/’index’ or 1/’column’}: determine if missing values are removed from rows or columns
  • how {‘any’, ‘all’}: determines if a row or column is removed when there is at least one missing value or all missing values
  • thresh (int): specify the number of non-missing values required to drop a row or column
  • subset (column or sequence labels): labels along another axis (e.g. when dropping rows a list of columns to consider)
  • inplace (bool): boolean value, if True the operation of removing missing values is done inplace and returns None

Dropping missing values from a DataFrame operates on an entire row or column, a single missing value cannot be dropped

Filling Null or Missing Values

Series.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) (documentation)

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) (documentation)

  • value (scalar, dict, Series, DataFrame): the value used to replace missing values. Passing in a dict/Series/DataFrame of values specifying which value to use for each index (for a Series), or column (for a DataFrame), values not in the dict/Series/DataFrame will not be replaced
  • method {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}: the method used for replacing missing values in the reindexed series
    pad/ffill: propagate the last valid observation forward to the next
    backfill/bfill: uses the next valid observation to replace missing values
  • axis: the axis to replace missing values
  • inplace (bool): boolean value, if True the operation of removing missing values is done inplace and returns None
  • limit (int):
    If method is specified, limit sets the maximum number of consecutive missing values to forward or backwards fill
    If method is not specified, limit sets the maximum number of missing values along the entire axis that will be filled
  • downcast (dict): a dictionary of item->dtype of what downcast or string ‘infer’ which will try to downcast to an appropriate type

Transforming Data

Sorting DataFrame Values

The two methods covered in this section, .sort_values() and .sort_index(), allow for efficient sorting of the Pandas DataFrame data structure.

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index='False', key=None) (documentation)

  • by (string or list of strings): Name or list of names to sort by
  • axis(0/’index’ or 1/’columns’): axis to be sorted
  • ascending (bool or list of bools), default True: If True sort values in ascending order, if False sort in descending order
  • inplace (bool): if True perform the operation in-place
  • kind (‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’): sorting algorithm used (see numpy.sort() for details)
  • na_position (‘first’ or ‘last’): argument specifying where NaNs should appear
  • ignore_index (bool): If True the result axis will be labeled 0, 1, ..., n-1
  • key (callable): when passed in the key function is applied to the series values before sorting

DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index='False', key=None) (documentation)

  • axis(0/’index’ or 1/’columns’): axis to be sorted
  • level (int or level name or lists of ints or names): sort on values in specified index level(s) when not None
  • ascending (bool or list of bools): If True sort values in ascending order, if False sort in descending order
  • inplace (bool): if True perform the operation in-place
  • kind (‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’): sorting algorithm used (see numpy.sort() for details)
  • na_position (‘first’ or ‘last’): argument specifying where NaNs should appear
  • sort_remaining (boo): when True and sorting by level and index is multilevel, sort by other level as well, in order, after sorting by specified level
  • ignore_index (bool): if True the result axis will be labeled 0, 1, ..., n-1
  • key (callable): when passed in the key function is applied to the series values before sorting

Add Columns / Rows

Add Columns to a DataFrame

Pandas offers a number of methods for adding columns of data to a DataFrame. The values of the new column can be given as an array or list of the same size as the DataFrame and then assigned to the new column by providing the name.

A common approach when the new column can be added to the end of the DataFrame (i.e. the last column), is to use the indexing operator []

DataFrame.insert(loc, column, value, allow_duplicates=False (documentation)

A method that provides the flexibility to add a column in any position as well as providing options for inserting the column values

  • loc (int): the insertion index, must be greater than or equal to 0 and less than or equal to the length of the DataFrame
  • column (string, number, or hashable object): label of the inserted column
  • value (scalar, series, or array-like): the values to be inserted
  • allow_duplicates (bool): boolean value indicating whether to allow duplicates or not

DataFrame.assign(**kwargs) (documentation)

A method that assigns new columns to a DataFrame and returns a new object with the original columns in addition to the newly defined columns.

  • **kwargs (dict of {str: callable or Series}): The column names are the key words of the dictionary. If values are callable, pandas will compute on the DataFrame and assign to the new columns

Multiple columns can be assigned using .assign() and later items in **kwargs can reference previously created new columns

Add Rows to a DataFrame

DataFrame.loc[] (documentation)

The .loc[] property allows access to a group of rows or columns and can be utilized to add rows to a DataFrame

For for more details and examples of selecting and subsetting a DataFrame see the article Pandas: Indexing and Slicing

pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True) (documentation)

A function that provides the ability to concatenate pandas objects along a particular axis

  • objs (sequence or mapping of Series or DataFrame objects): If mapping is passed, the sorted keys will be used as the keys argument unless it is passed in.
  • axis (0/’index’ or 1/’columns’): the axis to concatenate along
  • join (‘inner’, ‘outer’): how to handle indexes on other axes
  • ignore_index (bool): If True do not use the index value along the concatenation axis, typically helpful when the concatenation axis does not have meaningful indexing information
  • keys (sequence): when multiples levels are passed in they should contain tuples, hierarchical indexes are constructed using the passed keys as the outermost level
  • levels (list of sequences): unique values to use for constructing a multi-index
  • names (list): names for the levels in the hierarchical index
  • verify_integrity (bool): If True pandas will check whether the new concatenated axis contains duplicates
  • sort (bool): sort non-concatenation axis if it is not aligned when join is ‘outer’.
  • copy (bool): If False data is not copied unnecessarily

For more joining operations see Joining DataFrames

Remove Columns / Rows

.drop() is a method which removes rows or columns by providing label names and the corresponding axis or by specifying the index or column names.

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') (documentation)

  • labels (single label or list-like): the index or column label to drop
  • axis (0/’index’ or 1/’column’): drop labels from the index or columns
  • index (single label or list-like): can be used as an alternative to specifying the axis parameter. .drop(labels, axis=0) is equivalent to .drop(index=labels)
  • columns (single label or list-list): can be used as an alternative to specifying the axis parameter. .drop(labels, axis=1) is equivalent to .drop(columns=labels)
  • level (int or level name): the level from which the labels will be removed
  • inplace (bool): if True the drop operation occurs in place .drop() returns None otherwise a copy of the DataFrame is returned
  • errors (‘ignore’, ‘raise’): if ‘ignore’ errors will be suppressed and only existing labels are dropped from the DataFrame

Aggregating Data Frames

An essential step to gaining insights into a dataset is the ability to effectively summarize the data. Pandas offers a variety of methods ranging from simple calculations (e.g. .sum()) to more complex operations using groupby

Descriptive Statistics

Descriptive statistics are operations that summarize the central tendency and shape of a dataset

Basic summary methods include .count(), .sum(), .mode(), .min(), max(), .var(), .std(), .mean(), .quantile(), .median() (DataFrame descriptive statistics)

A few general parameters explored here include:

  • axis(0, 1): specify the axis (index or columns) the function should be applied on
  • skipna (bool): if True exclude null values when computing
  • level (int or level name): if the axis is hierarchical, count along a specified level, collapsing into a scalar
  • numeric_only (bool): includes only float, int, and boolean columns, when None (default) pandas attempts to use everything then only numeric data
  • **kwargs: any additional keyword arguments
  • min_count (int): specify the required number of valid value needed to perform the operation — parameter for .sum()

Method: .describe()

A helpful method for generating descriptive statistics is .describe()

DataFrame.describe(percentiles=None, include=None, exclude=None, datatime_is_numeric=False) (documentation)

  • percentiles (list-like): the percentiles to include in the output, and should be between 0 and 1, default = [.25, .5, .75]
  • include (‘all’, list-like, or None): list of data types to include in the result
  • exclude (list-like of dtypes or None): data types to omit from the result
  • datetime_is_numeric (bool): boolean value indicating whether datetime data types should be treated as numeric

Method: .agg()

The .agg() (alias of .aggregate()) method aggregates values using one or more operations over a specified axis.

DataFrame.agg(func=None, axis=0, *args, **kwargs) (documentation)

  • func (function, str, list, or dict): the function(s) to be used for aggregating the data
  • axis (0/’index’, 1/’columns’): the axis the operation is performed on
  • *args: positional arguments passed to func
  • **kwargs: keyword arguments passed to func

Pandas offers many other computation and descriptive statistic methods. For more information and example see the Pandas Documentation

Grouped Summary Statistics

The basic descriptive statistics can provide quick insights into the dataset, however to conditionally aggregate by a label or index requires the use of the groupby operation.

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_key=True, squeeze=NoDefault.no_default, observed=False, dropna=True) (documentation)

Parameters explored here include:

  • by (mapping, function, label, or list of labels): determines the grouping
  • level (int, level name, sequence of level names): specify a particular level(s) to group by if the axis hierarchical
  • dropna (bool): when True missing values will be dropped, when False missing values will be treated as the key in groups

Pivot Table

Similar to .groupby the .pivot_table() method provide a multidimensional summary of the dataset. While .groupby splits/combines data across a 1D index, pivot_table splits/combines the data across a 2D grid.

pandas.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True) (documentation)

Parameters explored here include:

  • values: the columns to aggregate — the column to summarize
  • index (column, Grouper, array, or a list of them): what to group by columns. If an array, it must be the same length as the DataFrame
  • columns (column, Grouper, array, or list of them): If an array, it must be the same length as the DataFrame
  • aggfunc (function, list of functions, dict): if a list of functions the pivot table will contain a hierarchical column whose top level will be the function names, if a dict is passed the key is the column to summarize and value is the function or list of functions
  • fill_value (scalar): specify a value to replace missing values in the resulting pivot table

Joining DataFrames

A key feature offered by Pandas is the ability to conduct in-memory join and merge operations. This feature is implemented through the pd.merge() function and the related .join() method of Series and DataFrame objects.

The pd.merge() function implements 3 types of joins

  • one-to-one: every row in the left table is related to only one row in the right table
  • many-to-one: every row in the left table is related to one or more rows in the right table
  • many-to-many: multiple rows in the left table is related to multiple rows in the right table

All three are performed calling the function pd.merge() and the type of join performed is dependent on the form of the input data

If both key columns contain rows where the key is a null value these rows will be matched — this is different behavior than the usual SQL join behavior

pandas.merge() is a function that can be used to perform database-style joins on Series and DataFrame objects.

pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) (documentation)

  • left: DataFrame
  • right: DataFrame or named Series
  • how (‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’): the type of join to perform
  • on (label or list): column or index level names to join on, must be found in both DataFrames.
  • left_on (label or list, or array-like): column or index level names to join on in the left DataFrame
  • right_on (label or list, or array-like): column or index level names to join on in the right DataFrame
  • left_index (bool): use the index from the left DataFrame as the join key(s)
  • right_index (bool): use the index from the right `DataFrame as the join key(s)
  • sort (bool): sort the join key in the resulting DataFrame
  • suffixes (list-like): a length-2 sequence with each element indicating the suffix to add to overlapping column names in the left and right DataFrames. None can be passed in for either the right or left to leave the corresponding column name as-is.
  • copy (bool): if False avoid copy when possible
  • indicator (bool or str): if True add a column to the resulting DataFrame called _merged containing information on the source of each row.
  • validate (str): if specified checks if the merge is of the specified type (“one_to_one”, “one_to_many”, or “many_to_one”)

DataFrame.merge() (documentation) is a method that can be called on a DataFrame object and provides the same join functionality with similar arguments. Using this method only the right DataFrame has to be specified.

If you enjoyed this article and found it helpful don’t forget to give it a clap, follow and subscribe to the INQUISITIVE NATURE publication!

Originally published at https://ethanguyant.com on June 6, 2022.

--

--