Dataframe Validation with Pandera

Seckin Dinc
8 min readMar 27, 2023

--

Lightweight data frame validation library for data analysts, data scientists, data engineers, and machine learning engineers.

Photo by Kunal Kalra on Unsplash

Managing data projects and products are getting more complex every day. It is not because of building state-of-art algorithms, real-time data pipelines, or lack of experimentation frameworks. It is due to the increasingly low quality of data in our organizations.

Today software engineering teams are using “agile development” as an excuse for not writing test codes, product managers who don’t invest enough time and focus on product discovery end up with unstable features that require continuous change right after the deployments, and dumping more and more data is thought to be the best practice of modern data lakes.

If ensuring data quality is not a company wise strategy, data teams need to come up with alternative solutions for their projects. In order to validate their functions and classes at their codes they can implement unit testing into their processes. But for the quality of the upstream data they get and the downstream data they generate, they need data validations in place.

What is Data Validation?

Data validation is the process of ensuring that data is accurate, complete, and consistent under predefined rules and criteria. Data validation is an important part of data management and is crucial for ensuring data accuracy and reliability.

Data validation can include various methods such as;

  • Data type validation: Ensuring that the data is of the correct data type; e.g. numbers, dates, or text.
  • Range validation: Checking whether the data falls within a predetermined range of values.
  • Format validation: Checking whether the data is in the correct format, such as a phone number or email address.
  • Completeness validation: Ensuring that all required data fields are filled in and that there are no missing values.

Pandera

Screenshot from https://pandera.readthedocs.io/en/latest/index.html#

Pandera is a lightweight data validation Python library. It is mainly used to validate dataframe objects with certain rules and criteria. As Niels Bantilan mentioned during a video conference, the library was initially built to support Pandas Dataframes but in the last years, they started to support Dask, Modin, and Spark.

Pandera is not a built-in Python package. You need to install it within your terminal with the pip install pandera command.

How should we position Pandera in the modern data stack?

Pandera is a lightweight library. It doesn’t offer various complex components that you can get from the Great Expectation or any commercial product. On the other hand, compared to those, it is quite easy to use Pandera. In less than an hour after your installation, you can define your data quality checks and share your findings with your team.

In this regard we can consider Pandera as a solution in various scenarios;

  • MVP data products: From dashboard building to machine learning models, data products need to be fast and accurate at their MVP states. When we are dealing with tons of SQL functions to capture the data we need, Pandera can be the foundation of the required data quality checks.
  • When setting up data quality foundations: Creating a data quality culture takes time and hard effort. Sometimes it can be hard to roll out complex solutions just to show basic problems in the pipelines to crossfunctional team members outside of data organization. At this stage, Pandera is very handy compared to SQL queries which may require thousands of lines of code just to test 5–10 checks.
  • Start-up companies: When your company is a start-up and there is only a single product you need to pay attention to, you should expect tons of data quality issues due to a lack of unit tests, continuous product pivots, no documentation, etc. In these kinds of scenarios, you can just use Pandera to solve low-hanging fruits in your pipelines.

End-to-end Dataframe Validation with Pandera

To demonstrate the capabilities of Pandera, I will use the movies.csv data set and load it into a Pandas Dataframe object. A quick look into the data set;

Dataframe schema validation

Dataframe Schemas enable users to define a specification of a schema to verify columns and indexes in the data frame. Below I am defining my schema with the expectations of the columns such as the “title” is a string and the “runtime” is an integer.

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

# loading csv file into Pandas Dataframe
df = pd.read_csv("movies.csv")

# defining the expected data frame schema specifications
schema = pa.DataFrameSchema(
{
"title": Column(str),
"rating": Column(str),
"year": Column(int),
"runtime": Column(int),
}
)

# validating the data frame with the expected schema
schema.validate(df)

Dataframe column validations

After we validated our data frame against the expected schema, we can proceed to extend our schema definitions by adding column-wise checks and controls. Pandera allows users either to define custom rules and apply predefined check-ins at the same time. In order to apply checks we need to import the “Check” class.

Below I am extending our schema with further checks;

  • Empty string
  • Null value
  • Categories are in a list
  • Integer rage
  • Composite index
schema = pa.DataFrameSchema(
{
"title": Column(
str,
pa.Check(lambda x: x.strip() != "", element_wise=True)
), # empty string check
"rating": Column(
str,
pa.Check(
lambda x: x.strip() != "", element_wise=True
), # empty string check
pa.Check.isin(rating_list),
), # being in the defined list
"year": Column(
int, Check(lambda x: x > 0), Check.in_range(1900, 2023) # null check
), # specifying the movie dates range
"runtime": Column(int, Check(lambda x: x > 0)), # null check
},
unique=["title","year"], # title and year combined creates a uniqueness control
)

schema.validate(df)

Reporting issues

Pandera throws an exception whenever it matches an issue in the data frame with the defined checks. Theoretically, it is an expected behavior but in real-life use cases, we have hundreds of different checks to be applied over millions of rows. We need Pandera to report all the issues at once so that we can take action on them. In order to do that we need to apply “Lazy Validation”. Below I will add various wrong data points to the data frame to get the errors we expect;

df = pd.read_csv("movies.csv")

# wrong data injection
df.loc[len(df)] = ["Almost Famous","R",2000,122]
df.loc[len(df)] = ["","",2000,122]
df.loc[len(df)] = ["Almost Famous","R",np.nan,np.nan]

rating_list = list(df.rating.unique())

schema = pa.DataFrameSchema(
{
"title": Column(str,
pa.Check(lambda x: x.strip() != "", element_wise= True, error = "title can't be empty")),# empty string check
"rating": Column(str,
pa.Check(lambda x: x.strip() != "", element_wise= True, error = "rating can't be empty"), # empty string check
pa.Check.isin(rating_list, error = "rating should be in the given list")), # being in the defined list
"year": Column(int,
Check(lambda x: x > 0, error = "year can't be empty"), # null check
Check.in_range(1900,2023, error = "years should be in the given range")), # specifying the movie dates range
"runtime": Column(int,
Check(lambda x: x > 0, error = "runtime can't be empty")), # null check
},

unique=["title","year"],
)

try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print("Schema errors and failure cases:")
print(err)
Screenshot by the author

Evaluating function input and output arguments

The real advantage of Pandera is the decorator's capabilities when we need to check the whole upstream and downstream ETL pipelines with predefined schema controls.

We can import the check_input, check_output or check_io decorators to check function arguments or returned variables from existing functions. These operations are really handy if we want to integrate Pandera with our data preparation functions.

In the example below, I will create a function to read a Pandas Dataframe to read, apply a basic ETL function to produce a new Pandas Dataframe and publish it. In order to test the input and output data frames, I will integrate Pandera decorators.

import pandas as pd
import pandera as pa
import numpy as np

from pandera import Column, DataFrameSchema, Check, check_io

df = pd.read_csv("movies.csv")

rating_list = list(df.rating.unique())

input_schema = pa.DataFrameSchema(
{
"title": Column(
str,
pa.Check(
lambda x: x.strip() != "",
element_wise=True,
error="title can't be empty",
),
), # empty string check
"rating": Column(
str,
pa.Check(
lambda x: x.strip() != "",
element_wise=True,
error="rating can't be empty",
), # empty string check
pa.Check.isin(rating_list, error="rating should be in the given list"),
), # being in the defined list
"year": Column(
int,
Check(lambda x: x > 0, error="year can't be empty"), # null check
Check.in_range(1900, 2023, error="years should be in the given range"),
), # specifying the movie dates range
"runtime": Column(
int, Check(lambda x: x > 0, error="runtime can't be empty")
), # null check
},
unique=["title", "year"],
)

output_schema = pa.DataFrameSchema(
{
"title": Column(
str,
pa.Check(
lambda x: x.strip() != "",
element_wise=True,
error="title can't be empty",
),
), # empty string check
"year": Column(
int,
Check(lambda x: x > 0, error="year can't be empty"), # null check
Check.in_range(1900, 2023, error="years should be in the given range"),
), # specifying the movie dates range
"runtime": Column(
int, Check(lambda x: x > 0, error="runtime can't be empty")
), # null check
},
unique=["title", "year"],
)


@check_io(df=input_schema, out=output_schema, lazy=True)
def only_R_rating(df):
return df[df["rating"] == "R"][["title", "year", "runtime"]]


only_R_rating(df).head()

As seen in the screenshot below, I generate a new data frame that ran through all the validation steps successfully.

Screenshot by the author

Conclusion

Data validation is a critical step in the data quality life cycle. In complex organizations, data procedures may not act as data owners and accomplish their duties as expected. This can end up with low-quality data creation which can impact various data products such as ETL pipelines, dashboards, and machine learning models. In order to prevent our data products from being impacted by low data quality issues, we need to validate the data before we use it.

Pandera is a great solution to prevent known-unknown data quality issues with user-friendly, easy-to-use, and implement capabilities. Before diving into more heavy-weight libraries or commercial products, I recommend taking a look into Pandera to fix low-hanging fruits in your pipelines.

Thanks a lot for reading 🙏

If you liked the article, check out my other articles.

If you want to get in touch, you can find me on Linkedin and Mentoring Club!

--

--

Seckin Dinc

Building successful data teams to develop great data products