How to start your Notebook?

Getting Started with Data Quality in Pandas

Essential Steps to Prepare Your Data for Analysis

CyCoderX
The Pythoneers

--

Photo by Douglas Fehr on Unsplash

When you first receive a dataset, the journey begins with understanding and verifying the quality of the data before diving into deeper analysis. This initial exploration can save time and prevent issues later in your analysis.

As you probably know, Pandas is a pivotal tool in the world of data science, widely recognized for its robust capabilities in data manipulation and analysis.

In this article, we will delve into how we can leverage Pandas effectively at the outset of our exploratory data analysis (EDA). We’ll explore how to initially assess your dataset’s integrity and quality using Pandas and discuss the corrective actions you can take if you encounter any data discrepancies or issues.

Step 1: Loading Your Data

When beginning a data analysis project, the initial step is to import your data and pandas provides a variety of functions to make this task straightforward. Here are a few of the most common methods:

Reading from Excel Files:

  • Utilize pd.read_excel() with the option parse_dates to automatically interpret columns as dates.
import pandas as pd

# Replace 'filename.xlsx' with the path to your data file
df_excel = pd.read_excel('filename.xlsx', parse_dates=['date_column'])

Reading from HTML:

  • Employ pd.read_html() to extract tables from an HTML document. Dates might need to be converted manually as read_html does not directly support parse_dates.
list_of_dfs = pd.read_html('http://example.com', flavor='bs4')
df_html = list_of_dfs[0] # if the desired table is the first one
df_html['date_column'] = pd.to_datetime(df_html['date_column'])

Reading from CSV Files:

  • Use pd.read_csv() with parse_dates to import CSV files and parse columns as dates.
df_csv = pd.read_csv('filename.csv', parse_dates=['date_column'])

Reading from JSON Files:

  • Utilize pd.read_json() and manually convert dates post-import as read_json lacks direct support for parse_dates.
df_json = pd.read_json('filename.json')
df_json['date_column'] = pd.to_datetime(df_json['date_column'])

Reading from SQL Database:

  • Use pd.read_sql(), pd.read_sql_table(), or pd.read_sql_query() to read data from a SQL database. Date parsing will depend on the SQL driver and database schema.
from sqlalchemy import create_engine
engine = create_engine('sqlite:///database.db')
df_sql = pd.read_sql('SELECT * FROM table_name', con=engine)
df_sql['date_column'] = pd.to_datetime(df_sql['date_column'])

Reading from Parquet Files:

  • Use pd.read_parquet(), which preserves date types if they were originally saved correctly.
df_parquet = pd.read_parquet('filename.parquet')
Photo by Jaredd Craig on Unsplash

Step 2: Examining the Data Structure

Once your data is loaded into a DataFrame (Pandas’ data structure), you should first get familiar with its structure using the head() and info() methods.

head() - Displays the first few rows of the DataFrame. This gives you a quick snapshot of the data and any obvious issues like strange values or formats.

df.head()

info() - Provides a concise summary of the DataFrame, including the number of entries, the number of non-null entries and the data type of each column. This helps in identifying columns with missing values and potential type mismatches.

df.info()

To view the distinct values of a column in a pandas DataFrame, you can use the .unique() method. This method returns an array of all unique values in the specified column. Here’s how you can do it:

# Viewing unique values in the column
unique_values = df['Column_Name'].unique()
print(unique_values)

If you’re working with a larger DataFrame and want to see how often each unique value occurs, you can use the .value_counts() method:

value_counts = df['Column_Name'].value_counts()
print(value_counts)

Step 3: Checking for Missing Values

Missing data can skew analysis and lead to inaccurate conclusions. You can quickly check for missing values using isnull() combined with sum().

print(df.isnull().sum())

This command provides a column-wise count of missing values. Based on the output, you have a few options:

  • Fill missing values using fillna(), which could involve strategies like using the mean, median, mode, or a specific value.
  • Remove rows or columns with missing values using dropna(), if the missing data is extensive or not crucial for your analysis.

To view only the columns in a pandas DataFrame that contain null values and exclude columns that have zero null values:

columns_with_null = df.columns[df.isnull().any()]

df[columns_with_null].isnull().sum()

Step 4: Descriptive Statistics

To understand the range, central tendency and dispersion of your data, use the describe() method. It provides count, mean, standard deviation, minimum, maximum and quantile values for numerical data.

print(df.describe())

This statistical summary is crucial for identifying outliers, errors, or inconsistencies in your dataset.

Step 5: Data Type Verification

Ensure each column’s data type aligns with what you expect based on the content. For instance, dates should be in DateTime format and identifiers like phone numbers might be better as strings rather than numerical values.

print(df.dtypes)

If corrections are needed, use astype() to convert data types:

df['Date'] = pd.to_datetime(df['Date'])
df['Category'] = df['Category'].astype('category')

Step 6: Duplicate Data

Duplicates can skew results by giving undue weight to repeated entries. Check for and remove duplicates:

print(df.duplicated().sum())
df.drop_duplicates(inplace=True)

Step 7: Dropping Null Values

Handling missing data is crucial in data preparation and sometimes the best approach is to remove these entries entirely. Pandas provides several options for this:

# Drop rows with any null values
df.dropna(inplace=True)

# Drop rows where all columns are null
df.dropna(how='all', inplace=True)

# Drop rows where null values appear in specific columns only
df.dropna(subset=['Column1', 'Column2'], inplace=True)

# Drop columns with any null values
df.dropna(axis=1, inplace=True)

# Drop columns where all entries are null
df.dropna(axis=1, how='all', inplace=True)

Explanation:

  • The dropna() method offers flexibility:
  • how='any' (default) removes rows/columns if any null values are present.
  • how='all' removes rows/columns if all values are null.
  • subset specifies in which columns to look for null values, helpful for dropping rows based on nulls in specific columns crucial to your analysis.
  • axis=0 (default) operates on rows, while axis=1 operates on columns.

Choosing the Right Option: Decide which option to use based on the significance of missing data in your context:

  • If the dataset is large and the missing data is minimal, removing affected rows might be simplest.
  • If crucial columns must have complete data for accurate analysis, consider removing rows based on specific columns.
  • If a column has many missing values and it’s not critical, consider removing that column entirely.
Photo by Luke Chesser on Unsplash

Conclusion:

In conclusion, starting your exploratory data analysis with Pandas equips you with a robust toolkit for understanding and preparing your dataset for deeper investigation. By following the steps outlined — from loading and examining the data to handling missing values and verifying data types — you set a solid foundation for your analysis. Remember, the quality of your insights is directly linked to the quality of your data preparation. Whether you’re a seasoned data scientist or a newcomer to the field, mastering these initial steps in Pandas can dramatically streamline your data analysis process and enhance the accuracy of your results. Harness the full potential of your data by making Pandas an integral part of your EDA workflow.

Final Words:

Thank you for taking the time to read my article.

Hey There! I’m Charilaos Alkiviades Savoullis, a data engineer who loves crafting end-to-end solutions. I write articles about Python, Databases, SQL, AI, Data Engineering, lifestyle and more!.

Join me as we explore the exciting world of tech, data and beyond!

For similar articles and updates, feel free to explore my Medium profile https://medium.com/@casavoullis

If you enjoyed this article, consider liking and following for future updates.

Don’t hesitate to connect with me on my socials:

LinkedIn: https://bit.ly/3UK8KNk
GitHub: https://bit.ly/3WrMzgm

Interested in more Python content and tips? Click here to check out my list on Medium.

Let me know in the comments below … or above, depending on your device 🙃

--

--

CyCoderX
The Pythoneers

Data Engineer | Python & SQL Enthusiast | Cloud & DB Specialist | AI Enthusiast | Lifestyle Blogger | Simplifying Big Data and Trends, one article at a time.