Pandas: Data Transformation
Pandas is a data analysis and manipulation library which provides three fundamental data structures: Index
, Series
, and DataFrame
. For more details on the data structures and an overview of the pandas library see a previous article - Pandas: Introduction to the Library.
The focus of this article is to service as a introduction and guide to the basic functions for modifying and manipulating the core data structures, including handling of missing data, DataFrame transformation, aggregating data, and joining DataFrames.
Handling of Missing Data
Datasets are typically not clean and ready for use, including that they may contain some level of missing data. How missing data is indicated can vary depending on the data source which can add an additional level of complication when handling the missing data.
Generally, there is no single optimal choice to indicate the presence of missing data. Typically, the indication of missing data center around using a mask that indicates missing values or a sentinel value that indicates a missing value.
Missing Data in Pandas
Within pandas missing data is indicated by one of the two Python null values: NaN
(Not a Number) indicating a missing floating-point value or the None
object, and pandas can convert between the two when needed.
Missing values will propagate through arithmetic operations between pandas objects. Descriptive statistics and certain computational methods (Series and DataFrame) all account for missing data (e.g. when summing data missing values will be treated as zero).
Null Value Operations
Pandas offer various methods for identifying, removing, and replacing missing values within the data structures.
Identifying Null or Missing Values
pandas.isna(obj)
(documentation)
- A function that takes a scalar or array-like object and is used to indicate whether there are missing values.
- Returns a scalar or array of boolean values indicating whether each element is missing
Series.isna()
(documentation)DataFrame.isna()
(documentation)
- A function used to identify missing values within a Series or DataFrame
- Returns a boolean same-sized object indicating whether an element is a missing value.
None
orNaN
get mapped toTrue
and everything else is mapped toFalse
.isnull()
is an alias for.isna()
.notna()
performs the opposite operation of.isna()
— Non-missing values get mapped toTrue
- Boolean masks, like the object returned by
isna()
, can be used directly as aSeries
orDataFrame
index. For more detail see: Pandas: Indexing and Slicing
Characters such as an empty string
' '
ornp.inf
are not considered missing values unlesspandas.options.mod.use_inf_as_na = True
is set.
Removing Null or Missing Values
Series.dropna(axis=0, inplace=False, how=None)
(documentation)
A function that returns a new series with missing values removed
- axis (0 or ‘index’): a
Series
has only one axis to drop values from - inplace (bool): if
True
the operation of removing missing values is done inplace and.dropna()
returns None - how (str): kept for compatibility, no currently used
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
(documentation)
- axis {0/’index’ or 1/’column’}: determine if missing values are removed from rows or columns
- how {‘any’, ‘all’}: determines if a row or column is removed when there is at least one missing value or all missing values
- thresh (int): specify the number of non-missing values required to drop a row or column
- subset (column or sequence labels): labels along another axis (e.g. when dropping rows a list of columns to consider)
- inplace (bool): boolean value, if
True
the operation of removing missing values is done inplace and returns None
Dropping missing values from a DataFrame operates on an entire row or column, a single missing value cannot be dropped
Filling Null or Missing Values
Series.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
(documentation)
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
(documentation)
- value (scalar, dict,
Series
,DataFrame
): the value used to replace missing values. Passing in a dict/Series
/DataFrame
of values specifying which value to use for each index (for aSeries
), or column (for aDataFrame
), values not in the dict/Series
/DataFrame
will not be replaced - method {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}: the method used for replacing missing values in the reindexed series
pad/ffill: propagate the last valid observation forward to the next
backfill/bfill: uses the next valid observation to replace missing values - axis: the axis to replace missing values
- inplace (bool): boolean value, if
True
the operation of removing missing values is done inplace and returns None - limit (int):
If method is specified, limit sets the maximum number of consecutive missing values to forward or backwards fill
If method is not specified, limit sets the maximum number of missing values along the entire axis that will be filled - downcast (dict): a dictionary of item->dtype of what downcast or string ‘infer’ which will try to downcast to an appropriate type
Transforming Data
Sorting DataFrame Values
The two methods covered in this section, .sort_values()
and .sort_index()
, allow for efficient sorting of the Pandas DataFrame
data structure.
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index='False', key=None)
(documentation)
- by (string or list of strings): Name or list of names to sort by
- axis(0/’index’ or 1/’columns’): axis to be sorted
- ascending (bool or list of bools), default
True
: IfTrue
sort values in ascending order, ifFalse
sort in descending order - inplace (bool): if
True
perform the operation in-place - kind (‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’): sorting algorithm used (see numpy.sort() for details)
- na_position (‘first’ or ‘last’): argument specifying where
NaNs
should appear - ignore_index (bool): If
True
the result axis will be labeled 0, 1, ..., n-1 - key (callable): when passed in the key function is applied to the series values before sorting
DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index='False', key=None)
(documentation)
- axis(0/’index’ or 1/’columns’): axis to be sorted
- level (int or level name or lists of ints or names): sort on values in specified index level(s) when not
None
- ascending (bool or list of bools): If
True
sort values in ascending order, ifFalse
sort in descending order - inplace (bool): if
True
perform the operation in-place - kind (‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’): sorting algorithm used (see numpy.sort() for details)
- na_position (‘first’ or ‘last’): argument specifying where
NaNs
should appear - sort_remaining (boo): when
True
and sorting by level and index is multilevel, sort by other level as well, in order, after sorting by specified level - ignore_index (bool): if
True
the result axis will be labeled 0, 1, ..., n-1 - key (callable): when passed in the key function is applied to the series values before sorting
Add Columns / Rows
Add Columns to a DataFrame
Pandas offers a number of methods for adding columns of data to a DataFrame
. The values of the new column can be given as an array or list of the same size as the DataFrame
and then assigned to the new column by providing the name.
A common approach when the new column can be added to the end of the DataFrame
(i.e. the last column), is to use the indexing operator []
DataFrame.insert(loc, column, value, allow_duplicates=False
(documentation)
A method that provides the flexibility to add a column in any position as well as providing options for inserting the column values
- loc (int): the insertion index, must be greater than or equal to 0 and less than or equal to the length of the DataFrame
- column (string, number, or hashable object): label of the inserted column
- value (scalar, series, or array-like): the values to be inserted
- allow_duplicates (bool): boolean value indicating whether to allow duplicates or not
DataFrame.assign(**kwargs)
(documentation)
A method that assigns new columns to a DataFrame
and returns a new object with the original columns in addition to the newly defined columns.
**kwargs
(dict of {str: callable or Series}): The column names are the key words of the dictionary. If values are callable, pandas will compute on theDataFrame
and assign to the new columns
Multiple columns can be assigned using
.assign()
and later items in**kwargs
can reference previously created new columns
Add Rows to a DataFrame
DataFrame.loc[]
(documentation)
The .loc[]
property allows access to a group of rows or columns and can be utilized to add rows to a DataFrame
For for more details and examples of selecting and subsetting a
DataFrame
see the article Pandas: Indexing and Slicing
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
(documentation)
A function that provides the ability to concatenate pandas objects along a particular axis
- objs (sequence or mapping of
Series
orDataFrame
objects): If mapping is passed, the sorted keys will be used as thekeys
argument unless it is passed in. - axis (0/’index’ or 1/’columns’): the axis to concatenate along
- join (‘inner’, ‘outer’): how to handle indexes on other axes
- ignore_index (bool): If
True
do not use the index value along the concatenation axis, typically helpful when the concatenation axis does not have meaningful indexing information - keys (sequence): when multiples levels are passed in they should contain tuples, hierarchical indexes are constructed using the passed keys as the outermost level
- levels (list of sequences): unique values to use for constructing a multi-index
- names (list): names for the levels in the hierarchical index
- verify_integrity (bool): If
True
pandas will check whether the new concatenated axis contains duplicates - sort (bool): sort non-concatenation axis if it is not aligned when join is ‘outer’.
- copy (bool): If
False
data is not copied unnecessarily
For more joining operations see Joining DataFrames
Remove Columns / Rows
.drop()
is a method which removes rows or columns by providing label names and the corresponding axis or by specifying the index or column names.
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
(documentation)
- labels (single label or list-like): the index or column label to drop
- axis (0/’index’ or 1/’column’): drop labels from the index or columns
- index (single label or list-like): can be used as an alternative to specifying the axis parameter.
.drop(labels, axis=0)
is equivalent to.drop(index=labels)
- columns (single label or list-list): can be used as an alternative to specifying the axis parameter.
.drop(labels, axis=1)
is equivalent to.drop(columns=labels)
- level (int or level name): the level from which the labels will be removed
- inplace (bool): if
True
the drop operation occurs in place.drop()
returnsNone
otherwise a copy of theDataFrame
is returned - errors (‘ignore’, ‘raise’): if ‘ignore’ errors will be suppressed and only existing labels are dropped from the
DataFrame
Aggregating Data Frames
An essential step to gaining insights into a dataset is the ability to effectively summarize the data. Pandas offers a variety of methods ranging from simple calculations (e.g. .sum()
) to more complex operations using groupby
Descriptive Statistics
Descriptive statistics are operations that summarize the central tendency and shape of a dataset
Basic summary methods include .count()
, .sum()
, .mode()
, .min()
, max()
, .var()
, .std()
, .mean()
, .quantile()
, .median()
(DataFrame descriptive statistics)
A few general parameters explored here include:
- axis(0, 1): specify the axis (index or columns) the function should be applied on
- skipna (bool): if
True
exclude null values when computing - level (int or level name): if the axis is hierarchical, count along a specified level, collapsing into a scalar
- numeric_only (bool): includes only float, int, and boolean columns, when
None
(default) pandas attempts to use everything then only numeric data **kwargs
: any additional keyword arguments- min_count (int): specify the required number of valid value needed to perform the operation — parameter for
.sum()
Method: .describe()
A helpful method for generating descriptive statistics is .describe()
DataFrame.describe(percentiles=None, include=None, exclude=None, datatime_is_numeric=False)
(documentation)
- percentiles (list-like): the percentiles to include in the output, and should be between 0 and 1, default = [.25, .5, .75]
- include (‘all’, list-like, or None): list of data types to include in the result
- exclude (list-like of dtypes or None): data types to omit from the result
- datetime_is_numeric (bool): boolean value indicating whether datetime data types should be treated as numeric
Method: .agg()
The .agg()
(alias of .aggregate()
) method aggregates values using one or more operations over a specified axis.
DataFrame.agg(func=None, axis=0, *args, **kwargs)
(documentation)
- func (function, str, list, or dict): the function(s) to be used for aggregating the data
- axis (0/’index’, 1/’columns’): the axis the operation is performed on
*args
: positional arguments passed tofunc
**kwargs
: keyword arguments passed tofunc
Pandas offers many other computation and descriptive statistic methods. For more information and example see the Pandas Documentation
Grouped Summary Statistics
The basic descriptive statistics can provide quick insights into the dataset, however to conditionally aggregate by a label or index requires the use of the groupby
operation.
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_key=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
(documentation)
Parameters explored here include:
- by (mapping, function, label, or list of labels): determines the grouping
- level (int, level name, sequence of level names): specify a particular level(s) to group by if the axis hierarchical
- dropna (bool): when
True
missing values will be dropped, whenFalse
missing values will be treated as the key in groups
Pivot Table
Similar to .groupby
the .pivot_table()
method provide a multidimensional summary of the dataset. While .groupby
splits/combines data across a 1D index, pivot_table
splits/combines the data across a 2D grid.
pandas.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)
(documentation)
Parameters explored here include:
- values: the columns to aggregate — the column to summarize
- index (column, Grouper, array, or a list of them): what to group by columns. If an array, it must be the same length as the
DataFrame
- columns (column, Grouper, array, or list of them): If an array, it must be the same length as the
DataFrame
- aggfunc (function, list of functions, dict): if a list of functions the pivot table will contain a hierarchical column whose top level will be the function names, if a dict is passed the key is the column to summarize and value is the function or list of functions
- fill_value (scalar): specify a value to replace missing values in the resulting pivot table
Joining DataFrames
A key feature offered by Pandas is the ability to conduct in-memory join and merge operations. This feature is implemented through the pd.merge()
function and the related .join()
method of Series
and DataFrame
objects.
The pd.merge()
function implements 3 types of joins
- one-to-one: every row in the left table is related to only one row in the right table
- many-to-one: every row in the left table is related to one or more rows in the right table
- many-to-many: multiple rows in the left table is related to multiple rows in the right table
All three are performed calling the function pd.merge()
and the type of join performed is dependent on the form of the input data
If both key columns contain rows where the key is a null value these rows will be matched — this is different behavior than the usual SQL join behavior
pandas.merge()
is a function that can be used to perform database-style joins on Series
and DataFrame
objects.
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
(documentation)
- left:
DataFrame
- right:
DataFrame
or namedSeries
- how (‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’): the type of join to perform
- on (label or list): column or index level names to join on, must be found in both
DataFrames
. - left_on (label or list, or array-like): column or index level names to join on in the left
DataFrame
- right_on (label or list, or array-like): column or index level names to join on in the right
DataFrame
- left_index (bool): use the index from the left
DataFrame
as the join key(s) - right_index (bool): use the index from the right `DataFrame as the join key(s)
- sort (bool): sort the join key in the resulting
DataFrame
- suffixes (list-like): a length-2 sequence with each element indicating the suffix to add to overlapping column names in the left and right
DataFrames
.None
can be passed in for either the right or left to leave the corresponding column name as-is. - copy (bool): if
False
avoid copy when possible - indicator (bool or str): if
True
add a column to the resultingDataFrame
called_merged
containing information on the source of each row. - validate (str): if specified checks if the merge is of the specified type (“one_to_one”, “one_to_many”, or “many_to_one”)
DataFrame.merge()
(documentation) is a method that can be called on a DataFrame
object and provides the same join functionality with similar arguments. Using this method only the right DataFrame
has to be specified.
If you enjoyed this article and found it helpful don’t forget to give it a clap, follow and subscribe to the INQUISITIVE NATURE publication!
Originally published at https://ethanguyant.com on June 6, 2022.