Pandas: Introduction to the Library

Ethan Guyant
Inquisitive Nature
5 min readJun 5, 2022

--

Overview

Pandas is a fast, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language ( pandas). Pandas excels at working with tabular data which can be stored using pandas DataFrames. DataFrames are multidimensional arrays which contain row and column labels and can store various data types including missing data. In addition to providing DataFrames as a data storage object, pandas also offers numerous operations that will be familiar to users of databases and spreadsheets.

Pandas Series and DataFrame objects build upon the NumPy array structure and provide flexibility (e.g. attaching data labels), data cleaning operations (e.g. working with missing data), and data transformation operations (e.g. groupings).

Tasks Pandas Excels At

  • Handling of missing data (NaN or None)
  • The addition or deletion of columns from DataFrames (size mutable)
  • Group by functionality for aggregating and transforming data
  • Label-based slicing, indexing, and subsetting of datasets
  • Merging and Joining datasets

A data science project typically includes an iterative cycle of data cleaning, data modeling, data analysis, and formatting of result; pandas is an tool to complete each of these tasks.

For more details on how pandas can be used for these tasks see:

Installing and Using Pandas

Install from PyPI

Pandas can be installed via pip from PyPI using:

$ pip install pandas

Import Pandas

Following the installation of Pandas it can be imported, typically pandas in imported under the alias pd.

import pandas as pd

Pandas Data Structures

There are three fundamental Pandas data structures, Series, DataFrame, and Index. Pandas offers numerous useful tools, method, and functionality on top of these fundamental structures.

  • Series: 1D labeled homogeneously-typed array of indexed data
  • DataFrame: 2D labeled, size-mutable tabular structure which can accommodate heterogeneously-typed columns
  • Index: Can be viewed as a immutable array or an ordered set

Conceptually it can be helpful to think of Pandas data structures as flexible containers of lower dimensional data (e.g. a DataFrame is a container for Series), and objects can be added to or removed from the containers.

Pandas Index

An Index object is the basic object for storing axis (row and column) labels for all pandas object. The Index object is an immutable sequence.

Both the Series and DataFrame objects contain an explicit index which can be utilized to reference and modify the data.

The general syntax for constructing an Index is (documentation)
pd.Index(data=None, dtype=None, copy=False, name=None, tupleize_cols=True, **kwargs)

Parameters:

  • data: array-like (1D): Can contain a Series, arrays, constants, dataclass or list-like objects
  • dtype: NumPy dtype, default of object: If dtype is None pandas identifies the best fit and when a dtype is provide pandas coerces to the specified dtype
  • copy: boolean value to copy input data
  • name: name to be stored in the index
  • tupleize_col: boolean value, when True pandas attempts to create a multi-index

Construct an Index object of integers

Pandas object are aimed at facilitating cross dataset operations (e.g. joins). The Index object aids in this by following typical conventions used by Python's built-in set data structure facilitating unions, intersections, differences, and other combinations to be computed in a familiar way.

For more information on Index attributes (e.g. .size), methods (e.g. .fillna()), and examples visit Pandas Documentation

Pandas Series

A Series object contains two main components, the sequence of indices and the sequences of values which can be accessed with the .index and .values attributes. A Series supports integer and label indexing and has a wide variety of methods for performing index based operations.

The general syntax for constructing a Series is (documentation)
pd.Series(data=None, index=None, dtype=None, name=None, copy=False)

Parameters:

  • data: array-like, iterable, dict or scalar value
  • index: array-like or index (1d), will default to RangeIndex (0, 1, …, n). If data is a dict-like and no index is provided the keys of the data will utilized for the index
  • dtype: data type of the output series, if not specified data will be used to infer dtype
  • copy: boolean value to copy input data

A Series can be created from a list or an array

The pandas Series object has an explicitly defined index for each element of the series. This index does not have to be an integer and does not need to be contiguous or sequential. The explicitly defined index provides flexibility and is an important difference between a NumPy array (implicit index) and a pandas Series (explicit index).

The explicit index of the Series provides similarities to a dictionary data structure, where keys are mapped to values. A Pandas Series can be created from a Python dictionary, by default the index will be constructed from the dictionary's sorted keys. Similar to a dictionary, the selection of an element can be done using square bracket [ ] notation, however unlike a dictionary the Series supports slicing operations.

For more information on Series attributes (e.g. .index), methods (e.g. .count()), and examples visit Pandas Documentation

Pandas DataFrame

A DataFrame is a two-dimensional, size-mutable, tabular data object. The DataFrame is a dictionary-like container for a Series which contains labeled rows and columns

Conceptually, a DataFrame is a sequence of Series objects which share the same index.

The general syntax for constructing a DataFrame is (documentation)
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Parameters:

  • data: ndarray, iterable, dict or DataFrame
  • index: array-like or index, default to RangeIndex if there is not index information in the input data and no index is provided
  • columns: labels used in the resulting DataFrame which defaults to RangeIndex (0, 1, ..., n)
  • dtype: data type of the output, only a single dtype is allowed
  • copy: boolean value to copy input data

Creating a DataFrame from two Series

A DataFrame can also be constructed from a list of dictionary objects

For more information on DataFrame attributes (e.g. .index), methods (e.g. .count()), and examples visit Pandas Documentation

Summary

Pandas offers a powerful and flexible data analytics tool which excels at the various stages of a data science project including data cleaning, data modeling, data analysis, and formatting of final results for visualizations. The pandas library offers three fundamental data structures: Index, Series, and DataFrame. Each of these data structures come with numerous tools, attributes, and methods. In general the pandas data structures can be thought of as flexible containers of lower dimensional data that allow for objects to be added to or removed from the container.

More details and examples can be found in the Pandas Documentation.

If you enjoyed this article and found it helpful don’t forget to give it a clap, follow and subscribe to the INQUISITIVE NATURE publication!

Originally published at https://ethanguyant.com on June 5, 2022.

--

--