Pandas: Introduction to the Library
Overview
Pandas is a fast, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language ( pandas). Pandas excels at working with tabular data which can be stored using pandas DataFrames
. DataFrames
are multidimensional arrays which contain row and column labels and can store various data types including missing data. In addition to providing DataFrames
as a data storage object, pandas also offers numerous operations that will be familiar to users of databases and spreadsheets.
Pandas
Series
andDataFrame
objects build upon the NumPy array structure and provide flexibility (e.g. attaching data labels), data cleaning operations (e.g. working with missing data), and data transformation operations (e.g. groupings).
Tasks Pandas Excels At
- Handling of missing data (
NaN
orNone
) - The addition or deletion of columns from DataFrames (size mutable)
- Group by functionality for aggregating and transforming data
- Label-based slicing, indexing, and subsetting of datasets
- Merging and Joining datasets
A data science project typically includes an iterative cycle of data cleaning, data modeling, data analysis, and formatting of result; pandas is an tool to complete each of these tasks.
For more details on how pandas can be used for these tasks see:
Installing and Using Pandas
Install from PyPI
Pandas can be installed via pip from PyPI using:
$ pip install pandas
Import Pandas
Following the installation of Pandas it can be imported, typically pandas in imported under the alias pd
.
import pandas as pd
Pandas Data Structures
There are three fundamental Pandas data structures, Series
, DataFrame
, and Index
. Pandas offers numerous useful tools, method, and functionality on top of these fundamental structures.
- Series: 1D labeled homogeneously-typed array of indexed data
- DataFrame: 2D labeled, size-mutable tabular structure which can accommodate heterogeneously-typed columns
- Index: Can be viewed as a immutable array or an ordered set
Conceptually it can be helpful to think of Pandas data structures as flexible containers of lower dimensional data (e.g. a DataFrame is a container for Series), and objects can be added to or removed from the containers.
Pandas Index
An Index
object is the basic object for storing axis (row and column) labels for all pandas object. The Index
object is an immutable sequence.
Both the Series
and DataFrame
objects contain an explicit index which can be utilized to reference and modify the data.
The general syntax for constructing an Index
is (documentation)pd.Index(data=None, dtype=None, copy=False, name=None, tupleize_cols=True, **kwargs)
Parameters:
- data: array-like (1D): Can contain a
Series
, arrays, constants, dataclass or list-like objects - dtype: NumPy dtype, default of object: If dtype is None pandas identifies the best fit and when a dtype is provide pandas coerces to the specified dtype
- copy: boolean value to copy input data
- name: name to be stored in the index
- tupleize_col: boolean value, when
True
pandas attempts to create a multi-index
Construct an Index
object of integers
Pandas object are aimed at facilitating cross dataset operations (e.g. joins). The Index
object aids in this by following typical conventions used by Python's built-in set
data structure facilitating unions, intersections, differences, and other combinations to be computed in a familiar way.
For more information on Index
attributes (e.g. .size
), methods (e.g. .fillna()
), and examples visit Pandas Documentation
Pandas Series
A Series
object contains two main components, the sequence of indices and the sequences of values which can be accessed with the .index
and .values
attributes. A Series
supports integer and label indexing and has a wide variety of methods for performing index based operations.
The general syntax for constructing a Series
is (documentation)pd.Series(data=None, index=None, dtype=None, name=None, copy=False)
Parameters:
- data: array-like, iterable, dict or scalar value
- index: array-like or index (1d), will default to RangeIndex (0, 1, …, n). If data is a dict-like and no index is provided the keys of the data will utilized for the index
- dtype: data type of the output series, if not specified data will be used to infer dtype
- copy: boolean value to copy input data
A Series
can be created from a list or an array
The pandas Series
object has an explicitly defined index for each element of the series. This index does not have to be an integer and does not need to be contiguous or sequential. The explicitly defined index provides flexibility and is an important difference between a NumPy array (implicit index) and a pandas Series
(explicit index).
The explicit index of the Series
provides similarities to a dictionary data structure, where keys are mapped to values. A Pandas Series
can be created from a Python dictionary, by default the index will be constructed from the dictionary's sorted keys. Similar to a dictionary, the selection of an element can be done using square bracket [ ]
notation, however unlike a dictionary the Series
supports slicing operations.
For more information on Series
attributes (e.g. .index
), methods (e.g. .count()
), and examples visit Pandas Documentation
Pandas DataFrame
A DataFrame
is a two-dimensional, size-mutable, tabular data object. The DataFrame
is a dictionary-like container for a Series
which contains labeled rows and columns
Conceptually, a
DataFrame
is a sequence ofSeries
objects which share the same index.
The general syntax for constructing a DataFrame
is (documentation)pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Parameters:
- data: ndarray, iterable, dict or DataFrame
- index: array-like or index, default to RangeIndex if there is not index information in the input data and no index is provided
- columns: labels used in the resulting
DataFrame
which defaults to RangeIndex (0, 1, ..., n) - dtype: data type of the output, only a single dtype is allowed
- copy: boolean value to copy input data
Creating a DataFrame
from two Series
A DataFrame
can also be constructed from a list of dictionary objects
For more information on DataFrame
attributes (e.g. .index
), methods (e.g. .count()
), and examples visit Pandas Documentation
Summary
Pandas offers a powerful and flexible data analytics tool which excels at the various stages of a data science project including data cleaning, data modeling, data analysis, and formatting of final results for visualizations. The pandas library offers three fundamental data structures: Index
, Series
, and DataFrame
. Each of these data structures come with numerous tools, attributes, and methods. In general the pandas data structures can be thought of as flexible containers of lower dimensional data that allow for objects to be added to or removed from the container.
More details and examples can be found in the Pandas Documentation.
If you enjoyed this article and found it helpful don’t forget to give it a clap, follow and subscribe to the INQUISITIVE NATURE publication!
Originally published at https://ethanguyant.com on June 5, 2022.