How to start your Notebook?
Getting Started with Data Quality in Pandas
Essential Steps to Prepare Your Data for Analysis
When you first receive a dataset, the journey begins with understanding and verifying the quality of the data before diving into deeper analysis. This initial exploration can save time and prevent issues later in your analysis.
As you probably know, Pandas is a pivotal tool in the world of data science, widely recognized for its robust capabilities in data manipulation and analysis.
In this article, we will delve into how we can leverage Pandas effectively at the outset of our exploratory data analysis (EDA). We’ll explore how to initially assess your dataset’s integrity and quality using Pandas and discuss the corrective actions you can take if you encounter any data discrepancies or issues.
Step 1: Loading Your Data
When beginning a data analysis project, the initial step is to import your data and pandas provides a variety of functions to make this task straightforward. Here are a few of the most common methods:
Reading from Excel Files:
- Utilize
pd.read_excel()
with the optionparse_dates
to automatically interpret columns as dates.
import pandas as pd
# Replace 'filename.xlsx' with the path to your data file
df_excel = pd.read_excel('filename.xlsx', parse_dates=['date_column'])
Reading from HTML:
- Employ
pd.read_html()
to extract tables from an HTML document. Dates might need to be converted manually asread_html
does not directly supportparse_dates
.
list_of_dfs = pd.read_html('http://example.com', flavor='bs4')
df_html = list_of_dfs[0] # if the desired table is the first one
df_html['date_column'] = pd.to_datetime(df_html['date_column'])
Reading from CSV Files:
- Use
pd.read_csv()
withparse_dates
to import CSV files and parse columns as dates.
df_csv = pd.read_csv('filename.csv', parse_dates=['date_column'])
Reading from JSON Files:
- Utilize
pd.read_json()
and manually convert dates post-import asread_json
lacks direct support forparse_dates
.
df_json = pd.read_json('filename.json')
df_json['date_column'] = pd.to_datetime(df_json['date_column'])
Reading from SQL Database:
- Use
pd.read_sql()
,pd.read_sql_table()
, orpd.read_sql_query()
to read data from a SQL database. Date parsing will depend on the SQL driver and database schema.
from sqlalchemy import create_engine
engine = create_engine('sqlite:///database.db')
df_sql = pd.read_sql('SELECT * FROM table_name', con=engine)
df_sql['date_column'] = pd.to_datetime(df_sql['date_column'])
Reading from Parquet Files:
- Use
pd.read_parquet()
, which preserves date types if they were originally saved correctly.
df_parquet = pd.read_parquet('filename.parquet')
Step 2: Examining the Data Structure
Once your data is loaded into a DataFrame (Pandas’ data structure), you should first get familiar with its structure using the head()
and info()
methods.
head()
- Displays the first few rows of the DataFrame. This gives you a quick snapshot of the data and any obvious issues like strange values or formats.
df.head()
info()
- Provides a concise summary of the DataFrame, including the number of entries, the number of non-null entries and the data type of each column. This helps in identifying columns with missing values and potential type mismatches.
df.info()
To view the distinct values of a column in a pandas DataFrame, you can use the .unique()
method. This method returns an array of all unique values in the specified column. Here’s how you can do it:
# Viewing unique values in the column
unique_values = df['Column_Name'].unique()
print(unique_values)
If you’re working with a larger DataFrame and want to see how often each unique value occurs, you can use the .value_counts()
method:
value_counts = df['Column_Name'].value_counts()
print(value_counts)
Step 3: Checking for Missing Values
Missing data can skew analysis and lead to inaccurate conclusions. You can quickly check for missing values using isnull()
combined with sum()
.
print(df.isnull().sum())
This command provides a column-wise count of missing values. Based on the output, you have a few options:
- Fill missing values using
fillna()
, which could involve strategies like using the mean, median, mode, or a specific value. - Remove rows or columns with missing values using
dropna()
, if the missing data is extensive or not crucial for your analysis.
To view only the columns in a pandas DataFrame that contain null values and exclude columns that have zero null values:
columns_with_null = df.columns[df.isnull().any()]
df[columns_with_null].isnull().sum()
Step 4: Descriptive Statistics
To understand the range, central tendency and dispersion of your data, use the describe()
method. It provides count, mean, standard deviation, minimum, maximum and quantile values for numerical data.
print(df.describe())
This statistical summary is crucial for identifying outliers, errors, or inconsistencies in your dataset.
Step 5: Data Type Verification
Ensure each column’s data type aligns with what you expect based on the content. For instance, dates should be in DateTime format and identifiers like phone numbers might be better as strings rather than numerical values.
print(df.dtypes)
If corrections are needed, use astype()
to convert data types:
df['Date'] = pd.to_datetime(df['Date'])
df['Category'] = df['Category'].astype('category')
Step 6: Duplicate Data
Duplicates can skew results by giving undue weight to repeated entries. Check for and remove duplicates:
print(df.duplicated().sum())
df.drop_duplicates(inplace=True)
Step 7: Dropping Null Values
Handling missing data is crucial in data preparation and sometimes the best approach is to remove these entries entirely. Pandas provides several options for this:
# Drop rows with any null values
df.dropna(inplace=True)
# Drop rows where all columns are null
df.dropna(how='all', inplace=True)
# Drop rows where null values appear in specific columns only
df.dropna(subset=['Column1', 'Column2'], inplace=True)
# Drop columns with any null values
df.dropna(axis=1, inplace=True)
# Drop columns where all entries are null
df.dropna(axis=1, how='all', inplace=True)
Explanation:
- The
dropna()
method offers flexibility: how='any'
(default) removes rows/columns if any null values are present.how='all'
removes rows/columns if all values are null.subset
specifies in which columns to look for null values, helpful for dropping rows based on nulls in specific columns crucial to your analysis.axis=0
(default) operates on rows, whileaxis=1
operates on columns.
Choosing the Right Option: Decide which option to use based on the significance of missing data in your context:
- If the dataset is large and the missing data is minimal, removing affected rows might be simplest.
- If crucial columns must have complete data for accurate analysis, consider removing rows based on specific columns.
- If a column has many missing values and it’s not critical, consider removing that column entirely.
Conclusion:
In conclusion, starting your exploratory data analysis with Pandas equips you with a robust toolkit for understanding and preparing your dataset for deeper investigation. By following the steps outlined — from loading and examining the data to handling missing values and verifying data types — you set a solid foundation for your analysis. Remember, the quality of your insights is directly linked to the quality of your data preparation. Whether you’re a seasoned data scientist or a newcomer to the field, mastering these initial steps in Pandas can dramatically streamline your data analysis process and enhance the accuracy of your results. Harness the full potential of your data by making Pandas an integral part of your EDA workflow.
Final Words:
Thank you for taking the time to read my article.
Hey There! I’m Charilaos Alkiviades Savoullis, a data engineer who loves crafting end-to-end solutions. I write articles about Python, Databases, SQL, AI, Data Engineering, lifestyle and more!.
Join me as we explore the exciting world of tech, data and beyond!
For similar articles and updates, feel free to explore my Medium profile https://medium.com/@casavoullis
If you enjoyed this article, consider liking and following for future updates.
Don’t hesitate to connect with me on my socials:
LinkedIn: https://bit.ly/3UK8KNk
GitHub: https://bit.ly/3WrMzgm
Interested in more Python content and tips? Click here to check out my list on Medium.
Let me know in the comments below … or above, depending on your device 🙃