Pandas Foundations for Data Science

Pandas is a python data analysis library built on numpy package.It is mainly used for data munging and preparation. They provide high-performance, easy to use data structures and data analysis tools. Data structures with lot of functionality such as providing meaningful labels, handle missing data, time series function, relational operations.

Data Ingestion

To use the pandas library, it must first be imported. pd is the most common alias used for pandas.

import pandas as pd

The two main data structures in pandas are Series, and DataFrame.

Series is a 1-D labeled array with index capable of holding data of any type (integer, string, float, python objects, etc.). Indexes are sequence of labels which are immutable(like dictionary keys). Syntax is pd.Series(data, index)

pd.Series([2, 4, 6, 8],  index=['a',  'b',  'c',  'd'])

DataFrame can be called as a bunch of series objects put together to share same index. DataFrames are 2D array with Series as their columns. It can be created using pd.DataFrame() with various inputs such as lists, dictionary, ndarray, other files (.csv, .xml, .sas etc). Consider the following example with dataframe ‘df’. Here it’s created from scratch by building a dictionary and creating the DataFrame.

However, creating from scratch is not required as we can get the data sets from files. To read the csv files, we shall need the method pd.read_csv().

The method pd.read_csv() is not just used for .csv files but, we can also read from url, .tsv files etc by changing the parameters accordingly. As the by default it is set to csv format. You can also read other formats such as excel, html, sas, stata, sql, table etc.

For CSV,

df = pd.read_csv(‘example.csv’) #to read 
df.to_csv(‘example’,index = False) #to write into a CSV file.

For Excel,

pd.read_excel(‘Excel_Sample.xlsx’, sheet_name=’Sheet1') #to read df.to_excel(‘Excel_Sample.xlsx’, sheet_name=’Sheet1') #to write into an Excel file.

For SQL

from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
df.to_sql('data', engine)
sql_df = pd.read_sql('data',con=engine)
sql_df.head()

The key functions are:

Read SQL database table into a DataFrame.

  • read_sql_table(table_name, con[, schema, …])

Read SQL query into a DataFrame.

  • read_sql_query(sql, con[, index_col, …])

Read SQL query or database table into a DataFrame.

  • read_sql(sql, con[, index_col, …])

Write records stored in a DataFrame to a SQL database.

  • DataFrame.to_sql(name, con[, flavor, …])

Data Exploration

The fundamental step before you proceed with analysis is inspect your data. We do this to see how the raw data looks like and how many observations are present and what their features are.

Preview Data

head() fetches only the first few records of your DataFrame which will be very helpful when you have tons of observations but want to check the data roughly.

tail() similar to head() fetches the last few records of the data frame.

The following are a few methods commonly used to get basic information of the dataset.

Create Columns

Notice that the column names are not present. So we shall create new column names with the help of df.columns and assigning the list of labels for columns.

Descriptive Statistics

If you want to look at the general descriptive statistics of your data, usingdf.describe() will provide them including count, mean, standard deviation etc.

Rename Columns

We have two rows, ‘Age’ and ‘score’ in our DataFrame. score is in lower case. If we want to change the names of our DataFrame columns, rename() method helps with it. using inplace = True will save the changes made to DataFrame. columns parameter, takes old_column_name and new_column_name as input in dictionary format.

Selecting Subset of Data

Selection of data can be made by using [].loc.iloc .

First let us see how we can select the columns using only indexing operator. The main purpose of this is to select the columns by column names. To select a single column as a Series pass the column name directly to df['col_name'] .

To select multiple columns as a DataFrame pass the column names as list to it: df[['col_name1', 'col_name2']]

Select by Label: using .loc method. It accepts only the row and column labels as input. You can select a single value or multiple rows or columns depending on the requirement.

Select by position

Given a pair of label-based indices, sometimes it’s necessary to find the corresponding positions. The .iloc indexer, as the name says is integer location indexer. It behaves much like .loc but, uses only integer locations to make selections.

Handling Missing Data

Missing data, occur commonly in data and can seriously effect the conclusions we draw from the inconsistent data.

GroupBy(Aggregate)

GroupBy is often used for aggregating(min, max, sum, count etc) the result by one or more columns of dataframe using keyword .groupby .

Useful Links

  1. Pandas official documentation
  2. Collection of Pandas Exercises
  3. Pandas CheatSheet
  4. Pandas DataFrame Notes

What I have discussed till now is just the tip of an iceberg. I hope you have found useful information here. If you liked what you read please press the clap icon. Feel free to share your thoughts and opinions in the response section below. Thank you :)