5 Fundamental Operations on a Pandas DataFrame
Tips and Tricks for Data Science
Pandas is a powerful and easy-to-use software library written in the Python programming language, and is used for data manipulation and analysis.
Installing pandas: https://pypi.org/project/pandas/
pip install pandas
What is a Pandas DataFrame?
A pandas DataFrame is a two dimensional data structure which stores data in a tabular form. Every row and column are labeled and can hold data of any type.
Here is an example:
1. Creating a pandas DataFrame
The pandas.DataFrame constructor:
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False
data
This parameter serves as the input to make a DataFrame, which could be a NumPy ndarray, iterable, dict or another DataFrame. An ndarray is a multidimensional container of items of the same type and size. An iterable is any Python object capable of returning its members one at a time, permitting to be iterated over in a for-loop. Some examples for iterables are lists, tuples and sets. Dict here can refer to pandas Series, arrays, constants or list-like objects.
index
This parameter could have an Index or an array-like data type and serves as the index for the row labels in the resulting DataFrame. If no indexing information is provided, this parameter will default to RangeIndex.
columns
This parameter could have an Index or an array-like data type and serves as the index for the column labels in the resulting DataFrame. If no indexing information is provided, this parameter will default to RangeIndex.
dtype
Each column in the DataFrame can only have a single data type. This parameter is used to force a certain data type. By default, datatype is inferred from data.
copy
When this parameter is set to True, and the input data is a DataFrame or a 2D ndarray, data is copied into the resulting DataFrame. By default, copy is set to False.
Creating a Pandas DataFrame from a Python Dictionary
import pandas as pd
d = {'Name' : ['John', 'Adam', 'Jane'], 'Age' : [25, 18, 30]}
pd.DataFrame(d)
The index parameter can be used to change the default row index and the columns parameter can be used to change the order of the keys:
d = {'Name' : ['John', 'Adam', 'Jane'], 'Age' : [25, 18, 30]}
pd.DataFrame(d, index=[10, 20, 30], columns=['First Name', 'Current Age'])
Creating a Pandas DataFrame from a list:
l = [['John', 25], ['Adam', 18], ['Jane', 30]]
pd.DataFrame(l, columns=['Name', 'Age'])
Creating a Pandas DataFrame from a File
For any Data Science process, the dataset is commonly stored in files having formats like CSV (Comma Separated Values). Pandas allows storing data along with their labels from a CSV file using the method pandas.read_csv().
2. Selecting Rows and Columns from a Pandas DataFrame
Selecting Columns from a Pandas DataFrame
Columns can be selected using their column names.
df[column_1, column_2])
Selecting Rows from a Pandas DataFrame
Pandas provides 2 attributes for selecting rows from a DataFrame: loc
and iloc
loc
is label-based, which means that the row label has to be specified and iloc
is integer-based which means that the integer index has to be specified.
3. Inserting Rows and Columns to a Pandas DataFrame
Inserting Rows to a Pandas DataFrame
One method of inserting a row into a DataFrame is to create a pandas.Series()
object and insert it at the end of the DataFrame using the pandas.DataFrame.append()
method. The column indices of the DataFrame serve as the index attribute for the Series object.
Inserting Columns to a Pandas DataFrame
One easy method of adding a column to a DataFrame is by just referring to the new column and assigning values.
4. Deleting Rows and Columns from a Pandas DataFrame
Deleting Rows from a Pandas DataFrame
A row can be deleted using the method pandas.DataFrame.drop() with it’s row label.
To delete a row based on a column, the index of the row is obtained using the DataFrame.index attribute and then the row with the index is deleted using the pandas.DataFrame.drop() method.
Deleting Columns from a Pandas DataFrame
A column can be deleted from a DataFrame based on its label as well as its position in the DataFrame using the method pandas.DataFrame.drop().
The axis
argument is set to 1 when dropping columns, and 0 when dropping rows.
5. Sorting a Pandas DataFrame
A Pandas DataFrame can be sorted using the pandas.DataFrame.sort_values()
method. The by parameter for the method serves as the label of the column to sort by and ascending is set to True for sorting in ascending order and to False for sorting in descending order.
References:
https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
https://realpython.com/pandas-dataframe/#creating-a-pandas-dataframe
https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html