Python Great Expectations — Does it take great responsibility?

Shitij Goyal
Polar Tropics
3 min readJan 24, 2022

--

Recently, I got the opportunity to use the data validation library Great Expectations. Like its literature counterpart, it has the potential to become a classic in the world of data.

From their docs, Great Expectations -

Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.

Developers write expectations (column_values_are_less_than_9) which the library then runs on various datasources using in memory processing(Pandas or Spark) or within database processing(RDBMS or Bigquery etc.). It’s an excellent library to solve the age old problem of ensuring data quality.

The Good

  1. Has a large number of pre-built tests/expectations

The library offers a number of expectations out of the box which are more than sufficient to begin the initial analysis of your data. eg — unique values, matching column values to regex, checking for non null values in a column, row count in a table etc.

2. Data docs — Running documentation on the profile of your data

A great feature of the library! Once your expectations are run and results have been obtained, the library generates static HTML pages which you can host on a platform as a static website. Data democratisation anyone?

3. Comparison between 2 different datasources

Personally, my favourite. We all are big fans of maker checker model. But how do we compare data across databases? Writing custom code is buggy and with too many edge cases. Great Expectations handles this by profiling one datasource, generating automatic expectations and then applying those on the second datasource. Any differences are highlighted.

4. In database calculations

GE utilises SQLAlchemy connectors to transform your expectations into SQL queries and run them in the database as compared to getting data from database into pandas and running calculations on the machine. Compute where storage is! Modern Data Stack baby!

The Bad

  1. Profiling is slow. Like really slow. GE runs each expectation on each column individually. If you have a large number of expectations or your compute store is slow, best Netflix and chill and watch that episode of Witcher before coming back to check on the progress.
  2. Creating a custom expectation besides the basic one is not easy. You need to have a very good grasp of SQLAlchemy fundamentals and understand how it is working its magic under the hood.
  3. Documentation is patchy. Documents have versions mismatch and it can become a pain to even start with simple usecases. Additionally the concepts and terminology explained is not really beginner friendly, though there has been quite some progress on that front recently by the community.

Conclusion

Great Expectations is a great library for data quality with loads of integrations. Do give it a try but be prepared to spend some time splashing about the documentation. Where I work, we have started working with it and are quite satisfied with its features. We have also extended its Profiler to use multithreading (as most expectations are IO bound, being external database calls) as well as batching the expectations to compute similar items in one pass (eg — compute max of all numeric columns in one query as compared to doing it individually). Will deep dive into it soon!

Let me know in the comments if you would like additional info or another post with code snippets.

--

--