cleanframes — data cleansing library for Apache Spark!

Dawid Rutowicz
Jun 25 · 4 min read

Last computer engineering decade has exploded with solutions based on a cloud computing, distributed systems, big data, streaming or artificial intelligence. Today’s world systems are data-intensive so that volumes become higher and cheaper. Shortly speaking, data is at the centre of our interests. One of issues in these area is a data quality — why is it so important?

Nowadays, data is everywhere and drives companies and their operations. If we make decisions based on not reliable or complete information, it might have a huge impact on these and end up with tragic consequences. The data correctness and prominence reserves a special discipline and one of such processes is known as a data cleansing — the process focused on removing or correcting coarse records.

cleanframes is a small library for Apache Spark to make data cleansing automated and enjoyable. Let’s walk-through problem step by step to examine what it takes to make data quality as awesome and reliable as possible! 🤘


Introduction

Let’s introduce a sample data and for a simplicity-sake — a small set:

col1,col2,col3
1,true,1.0
lmfao,true,2.0
3,false,3.0
4,true,yolo data
5,true,5.0

and a domain model that for its representation is defined as:

If you wonder why there are options across the case class the answer is simple: scala primitives are values and cannot be nulls. Options liberate us by setting dataframe schema columns’ nullability to true.

Let’s load the data and check what will Spark do about them:

and the collect contains following result:

As noticed, due to malformed values, second and forth lines don’t retrain correct data and reject entire rows. We would be much better if it would be possible to ignore only invalid cells and save correct ones.

Our goal is to represent input data as following:

How can we achieve this? 🤔


Pure Spark SQL API

Apache Spark comes with set of API to work with dataframes that could be used for graceful parsing:

These lines do just the job great, however, there is so much code to do such a simple thing and yet there are only three columns to be taken care of! Typically, their number is much greater, which only means, the boilerplate code grows linearly 😞


Can we do better?

Well, of course we can! You could always write helper methods and re-use them in the project just like this:

The question is, is good enough? We have achieved some re-usability but still, did we give our best shot? Frankly speaking, the chain of these calls are awful and error-prone. They could be hidden with some implicit case class to imitate some DSL:

There is some progress noticed, nonetheless, it’s actually more visual one. We need to do it manually and repetitively throughout our code base. There is some room left for improvement 🤘.


cleanframes

cleanframes is a library that aims to automate data cleansing in Spark SQL with help of a generic programming. Just add two import and call clean method:

This code is equivalent to one listed in “Pure Spark SQL API” section. Due to this, there is no performance difference between the two.

What the heck has just happened? 😧

The method clean expands code through implicit resolutions based on a case class’ elements. Scala compiler applies a specific method to a corresponding element’s type. cleanframes comes with predefined implementations that are available by a simple library import (see, line 2). I recommend you to check their source code on github.

It’s worth to mention that these transformations are used from Spark sql functions package so they are catalyst-friendly.

The full solution code is literally as following:

And ready to go with transformations! 😇

What are pros and cons?

It’s time to examine advantages and disadvantages of the library.

Pros:

  • boilerplate is done for you by the compiler
  • common project transformations can be defined in one place and reused
  • no additional overhead compared to same manual code
  • changes in case classes are automatically reflected and applied in transformations during compilation

Cons:

  • implicits

Scala based libraries have this advantage over Java counterparts that they are not based on reflections API but type classes approach. It brings code safety during compilation time and reduces computation overhead.

The implicits is a magic word could cause wet pants for some but hey! This is a straightforward case where all you do is parsing simple types! There is no endo-mumbo-jumbo-functors in this scope.

Don’t freak out, be reasonable! 👻


Summary

This article shows only an introduction and a simple demo of cleanframes usage. The library capable of doing much more.

In the following article, there are explained features such as:

  • overriding a custom transformation
  • adding support for new types
  • defining transformation using Spark UDFs
  • transforming nested case classes
  • handling case class with same types

cleanframes — data cleansing library for Apache Spark! (part 2)” presents all of them one by one with code examples and explanations.

I hope that cleanframes can help you with a tedious but necessary work and cut the boilerplate code. The project has been recently released on Apache license and I would appreciate contribution of all kind: feature proposals, issues, pull requests, you name it! And of course: give it a try! 😊

Dawid Rutowicz

Written by

Software developer: https://twitter.com/dawrutowicz