Pandas Foundations for Data Science
Pandas is a python data analysis library built on numpy package.It is mainly used for data munging and preparation. They provide high-performance, easy to use data structures and data analysis tools. Data structures with lot of functionality such as providing meaningful labels, handle missing data, time series function, relational operations.
To use the pandas library, it must first be imported.
pd is the most common alias used for pandas.
import pandas as pd
The two main data structures in pandas are Series, and DataFrame.
Series is a 1-D labeled array with index capable of holding data of any type (integer, string, float, python objects, etc.). Indexes are sequence of labels which are immutable(like dictionary keys). Syntax is
pd.Series([2, 4, 6, 8], index=['a', 'b', 'c', 'd'])
DataFrame can be called as a bunch of series objects put together to share same index. DataFrames are 2D array with Series as their columns. It can be created using
pd.DataFrame() with various inputs such as lists, dictionary, ndarray, other files (.csv, .xml, .sas etc). Consider the following example with dataframe ‘
df’. Here it’s created from scratch by building a dictionary and creating the DataFrame.
However, creating from scratch is not required as we can get the data sets from files. To read the csv files, we shall need the method
pd.read_csv() is not just used for .csv files but, we can also read from url, .tsv files etc by changing the parameters accordingly. As the by default it is set to csv format. You can also read other formats such as excel, html, sas, stata, sql, table etc.
df = pd.read_csv(‘example.csv’) #to read
df.to_csv(‘example’,index = False) #to write into a CSV file.
pd.read_excel(‘Excel_Sample.xlsx’, sheet_name=’Sheet1') #to read df.to_excel(‘Excel_Sample.xlsx’, sheet_name=’Sheet1') #to write into an Excel file.
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
sql_df = pd.read_sql('data',con=engine)
The key functions are:
Read SQL database table into a DataFrame.
- read_sql_table(table_name, con[, schema, …])
Read SQL query into a DataFrame.
- read_sql_query(sql, con[, index_col, …])
Read SQL query or database table into a DataFrame.
- read_sql(sql, con[, index_col, …])
Write records stored in a DataFrame to a SQL database.
- DataFrame.to_sql(name, con[, flavor, …])
The fundamental step before you proceed with analysis is inspect your data. We do this to see how the raw data looks like and how many observations are present and what their features are.
head() fetches only the first few records of your DataFrame which will be very helpful when you have tons of observations but want to check the data roughly.
tail() similar to head() fetches the last few records of the data frame.
The following are a few methods commonly used to get basic information of the dataset.
Notice that the column names are not present. So we shall create new column names with the help of
df.columns and assigning the list of labels for columns.
If you want to look at the general descriptive statistics of your data, using
df.describe() will provide them including count, mean, standard deviation etc.
We have two rows, ‘Age’ and ‘score’ in our DataFrame.
score is in lower case. If we want to change the names of our DataFrame columns,
rename() method helps with it. using
inplace = True will save the changes made to DataFrame. columns parameter, takes old_column_name and new_column_name as input in dictionary format.
Selecting Subset of Data
Selection of data can be made by using
First let us see how we can select the columns using only indexing operator. The main purpose of this is to select the columns by column names. To select a single column as a Series pass the column name directly to
To select multiple columns as a DataFrame pass the column names as list to it:
Select by Label: using
.loc method. It accepts only the row and column labels as input. You can select a single value or multiple rows or columns depending on the requirement.
Select by position
Given a pair of label-based indices, sometimes it’s necessary to find the corresponding positions. The
.iloc indexer, as the name says is integer location indexer. It behaves much like
.loc but, uses only integer locations to make selections.
Handling Missing Data
Missing data, occur commonly in data and can seriously effect the conclusions we draw from the inconsistent data.
GroupBy is often used for aggregating(min, max, sum, count etc) the result by one or more columns of dataframe using keyword
What I have discussed till now is just the tip of an iceberg. I hope you have found useful information here. If you liked what you read please press the clap icon. Feel free to share your thoughts and opinions in the response section below. Thank you :)