pandas for Data Science: Part 1

Data Structures in pandas

Rukshan Pramoditha
Data Science 365
10 min readMay 12, 2020

--

Hello! Welcome to the 1st tutorial of pandas: Data Structures in pandas. In this tutorial, I discuss the following things with examples.

Topics discussing

  • Introduction to pandas: What is pandas library? Definition of data structure, Overview of data structures in pandas, Import convention
  • The pandas Series: Introduction, Series creation using different ways, indexing and slicing, Mutability of series, Operations and mathematical functions
  • The pandas DataFrame: Introduction, DataFrame creation using different ways, pandas read_*, Value and size mutability of DataFrame, Column as a series
  • The pandas Panel

pandas is a fast, powerful, flexible and easy to use data analysis library built on top of NumPy and provides features not available in it. pandas stands for tabular heterogeneous data. By contrast, NumPy is best suited for working with homogeneous numerical data.

The key to learning pandas is to understand its data structures. There are three main data structures in pandas:

  • Series — 1D
  • DataFrame — 2D
  • Panel — 3D

The most widely used pandas data structures are the Series and the DataFrame. Simply, a Series is similar to a single column of data while a DataFrame is similar to a sheet with rows and columns. Likewise, a Panel can have many DataFrames.

Import convention

Throughout this tutorial, I use the following import convention for pandas.

By convention, the pandas library is imported under the alias pd. By doing this, you can access the functions, classes and sub-packages in the pandas library using the pd.namespace [ex: pd.Series(), pd.DataFrame(), pd.DataFrame.index].

The pandas Series

The Series object represents one-dimensional data structures in Pandas. A series consists of two components.

  • One-dimensional data (Values)
  • Index

The series is composed of two arrays associated with each other. The main array (array of values) holds one-dimensional data to which each element is associated with a label, contained within the other array (array of labels), called the index. If you want to individually see the two arrays that make up the series, you can call index and values attributes of the series. Because a series is one dimensional, it has a single axis (dimension) — the index and the values of the index — 0, 1, 2, 3 — are called axis labels.

Series creation: Introduction

The general construct for creating a Series data structure is:

To create a series, you simply call the Series() class constructor and pass as an argument containing the data to be included in it. Here, data can be one of the following:

  • A one-dimensional ndarray
  • A Python list
  • A Python dictionary
  • A scalar value

If an index is not specified, the default index [0,… n-1] will be created, where n is the length of the data. A series can be created in a variety of ways.

Series creation: Using a one-dimensional ndarray

The following example creates a Series of the 1st 5 odd numbers.

If you do not specify any index in the function, by default, pandas will assign numerical values increasing from 0 as labels. In this case, the labels correspond to the indexes (position in the array) of the elements in the series object. If you want to create this series using meaningful labels, you would specify the index parameter during the series creation. Labels are included inside a list of the same length of an_array.

If you want to individually see the two arrays that make up this series, you can call index and values attributes of the series.

Caution: When you create a Series using an ndarray, the ndarray should be one-dimensional. If you pass a multi-dimensional array as the data parameter, an exception error will occur.

Note 1: The values of a Series can hold strings, integers, floats, booleans.

Note 2: You can also specify the data type of Series values when it is created. You should provide a valid data type for the dtype parameter.

Series creation: Using a Python list

To create a series using a Python list, you can just pass a list to the data parameter of the Series() class constructor.

Series creation: Using a Python dictionary

To create a series using a Python dictionary, you can just pass a dictionary to the data parameter of the Series() class constructor. This time, the arrays of the index and values are filled with the corresponding keys and values of the dictionary.

Series creation: Using a scalar value

we can also create a Series from a scalar value. If you do not specify the index argument, the default index is 0. If you specify the index, the value will be repeated for specified index values.

Selecting elements from a Series

The indexing and slicing that are applicable to NumPy arrays can be extended to the series because pandas library was built on top of NumPy. To learn more about indexing and slicing, read this tutorial.

  • You can select a single element using the index number or index label of the series.
  • You can use the slice : notation with index numbers to select a range of elements from a series.
  • You can also use a list of index numbers or index labels to select multiple elements from a series.
  • You can also use the conditions and Boolean operators to select elements from a series.

Assigning values to the elements

Series are mutable, which means that you can change the value of an element in the series after it has been initialized.

Operations and mathematical functions on series

Operations such as operators (+, -, *, /) and mathematical functions that are applicable to NumPy array can be extended to series.

You can simply write the arithmetic expressions for series.

For other operations like getting the mean, you can use the methods of a series object or the NumPy mathematical functions. However, with the NumPy mathematical functions, you must pass an instance of the series as an argument.

The pandas DataFrame

A DataFrame is a two-dimensional data structure composed of rows and columns — exactly like a simple spreadsheet or a SQL table. Each column of a DataFrame is a pandas Series. These columns should be of the same length, but they can be of different data types — float, int, bool, and so on. DataFrames are both value and size-mutable (A Series, by contrast, is only value-mutable, not size-mutable. The length of a Series cannot be changed although the values can be changed). This lets us perform operations that would alter values held within the DataFrame or add/delete columns to/from the DataFrame.

A DataFrame consists of three components.

  • Two-dimensional data (Values)
  • Row index
  • Column index

The DataFrame has two index arrays. Each label in the first array is associated with all the values in the row. The labels of the second array are associated with a particular column. There are two axes (dimensions) for a DataFrame which are commonly referred to as axis 0 and 1, or the row/index axis and the column axis respectively.

DataFrame creation: Introduction

The general construct for creating a DataFrame data structure is:

A DataFrame is the most commonly used data structure in pandas. The DataFrame() class constructor accepts many different types of arguments:

  • A two-dimensional ndarray
  • A dictionary of dictionaries
  • A dictionary of lists
  • A dictionary of series

Row label indexes and column labels can be specified along with the data. If they’re not specified, they will be generated from the input data in an intuitive fashion. A DataFrame can be created in a variety of ways.

DataFrame creation: Using a two-dimensional ndarray

If you want to see the individual components which make up the DataFrame, you can call values, index and columns attributes of the DataFrame.

DataFrame creation: Using a dictionary of dictionaries

Column names are created from the keys of the main dictionary, and the row index is created from the keys of the sub dictionaries.

DataFrame creation: Using a dictionary lists

If you want to see the individual components which make up the DataFrame, you can call values, index and columns attributes of the DataFrame.

DataFrame creation: Using a dictionary of series

DataFrame creation: Using pandas read_*

pandas supports many different file formats such as csv, excel, sql, json, each of them with the prefix read_*

Image copyright: pandas official website
  • pandas read_csv() function: Reads a comma-separated values (csv) file or a text file into a pandas DataFrame.
  • pandas read_excel() function: Reads an Excel file into a pandas DataFrame.
  • pandas read_html() function: Reads HTML tables.
  • pandas read_sql() function: Read SQL query or database table into a DataFrame.

Now, I discuss an example of reading a text (.txt) file into a pandas DataFrame. For this, I use the pandas read_csv() function. CSV stands for comma-separated values. The comma is the default delimiter. However, they accept other delimiters such as tab as well.

Often, data is stored in .txt files with different kinds of delimiters. The sep parameter can be used to specify the delimiter of a particular text file.

Text data with Tab delimiter
Using pandas read_csv() to read text (.txt) data

Selecting elements from a DataFrame

I will discuss this topic in a separate article.

Assigning values to the elements and adding new columns

DataFrames are both value-mutable and size-mutable. This means that you can change values held within the DataFrame or add/delete columns to/from the DataFrame.

Value mutability (changing values)

Size mutability (Adding new rows and columns)

Adding a new column
Adding a new row

I will discuss these operations in detail later.

Each column in a DataFrame is a Series

If you’re just interested in working with the data in the column marks, you can extract it as a series.

The pandas Panel

A Panel is a 3D array. It is not commonly used. It is not as easily displayed on screen or visualized as the other two because of its 3D nature. It is generally used for 3D time-series data. The three-axis names are as follows:

  • items: This is axis 0. Each item corresponds to a DataFrame structure.
  • major_axis: This is axis 1. Each item corresponds to the rows of the DataFrame structure.
  • minor_axis: This is axis 2. Each item corresponds to the columns of each DataFrame structure.

As with Series and DataFrames, there are different ways to create Panel objects.

Panel creation: Using a 3D NumPy array

Data Science 365

This tutorial was designed and created by Rukshan Pramoditha, the Author of Data Science 365 Blog.

Technologies used in this tutorial

  • Python
  • NumPy
  • pandas
  • Jupyter Notebook

2020–05–12

--

--

Rukshan Pramoditha
Data Science 365

3,000,000+ Views | BSc in Stats | Top 50 Data Science, AI/ML Technical Writer on Medium | Data Science Masterclass: https://datasciencemasterclass.substack.com