Making data cleaning simple with the library

The library is a tool to simplify and enable quick data preparation prior to any analysis step in Spark. The library provides tools to map, visualize, and transform data for iterative analysis. Quality metrics, metadata, and summary statistics enable a data scientist to get a visceral sense of the data preparation progress. The library is available to use from Python and Scala notebooks in IBM Data Science Experience.

Why we built

Before a data scientist can derive value from data, the data must be cleaned up and converted into a unified form for use by an algorithm. Hence, data scientists spend 80% of their time doing data preparation.

Too much time is spent handcrafting and tweaking data — often referred to as data wrangling, data munging, data janitor work, and so on — a vital but necessary evil required to make sense of any data.

As platform and tool providers, we often hear this sentiment from our customers, and if you’re a data scientist, you probably experience the pain of data preparation regularly.

We built to ease the pain of data cleansing. For data scientists who want to be productive in Python or Scala, does the heavy lifting, enabling better data pipelines and helping scale on very large data sets by leveraging the power of Spark.

The Sparkling features

As we work with customers, we often find that their data is scattered in a variety of containers and subcontainers containing heterogeneous data. To execute an analysis, customers were needing to first categorize the data in folders containing similar types of data. Such categorization is a tedious and daunting task, especially for voluminous data, where the bulk of business value lies.

This challenge led us to our first sparkling feature–Automated discovery of file types.

Automated discovery of file types

Whether you are working with data in Swift,S3, GPFS, or HDFS, discovers file types and returns a Spark data frame that represents the frequently occurring data types. The task of mapping the right files to the right readers or categorizing data into folders or containers of the same type is done for you. This simplifies the programming task significantly.

Often with Spark, you have to figure out the right package name to use with the right data.

df =“com.databricks.spark.csv").load("dirContainingCsv")’s discovery package figures out the most commonly occurring file types and returns a data frame.

df =“").load("sparklingdata/data/AllDataDir")

In addition, you can list all the file types discovered by occurrence and choose to read a specific type selectively. In cases when file types do not exist, you can override the discovery and explicitly provide types to use for reading.

This feature is available for major text-based file formats: CSV, TSV, Parquet, Avro, and JSON. In addition, PDF and document file formats are also supported.

Automated discovery of data types

Our customers also need visibility into the data, not just a view into the sample of the raw data, but visibility into the more interesting aspects, such as:

  • What are the data column types?
  • How many fields match the data type?
  • How many fields are mismatches?
  • Which fields are matches?
  • Which fields are mismatches?

Sparkling Data, offers several features that automatically detect types by scanning a sample and detecting the types, or by looking at the entire data set. Most libraries handle the cases where all fields contain the same type well. In such cases, the column type can be converted to the detected type without any “data loss.”

In the big data world though, this is often not the case. Take an illustrative example of a column containing data representing age, with values such as 22, 14, 80, age-41, 75, age-47, 40.

Most libraries would recommend converting this column to a string to include all values. However, a string column for age data would not help the downstream analysis of the data.’s features to handle bad data use the principle of data cleaning with data inclusion.’s type detection recommends the most occurring type seen in the data. In the above case, the recommendation for the age column would be an integer.

Visibility into mismatched data

Most libraries provide the option to manually convert the type of the age column above to Integer, which switches the strings to nulls, leading to data loss. gives visibility into the mismatched data, leading to an opportunity to fix the data. In the above case, the mismatched data shows all the entries starting with “age-“.

In Figure 1, see values marked with >< to indicate data mismatches, for the Age column in the lower half of the figure.

Figure 1: View recommended types and mismatched values

Once the data scientist has visibility into the mismatched data, he or she can now transform the data to fix it. In the above case, calling a substitute function on the column to replace "age-" with an empty string cleans the data resulting in all integer values. The data scientist can now act on's recommendation to convert the column to integer, without losing any data. The process is iterative, transparent, and quick. provides several features that allow the data scientist to provide data set or column specific information to improve quality. Some examples include providing locale information to ensure accurate type conversion of numeric and date values, or providing custom formats to include in the automatic date and timestamp detection and conversion.

Automatic interpretation of business types

When it comes to detecting types, goes one step further by interpreting business types such as person names, addresses, phone numbers, and organization names. uses IBM’s technologies such as text analytics to detect these types.

In Figure 2 below, see automatically detected business types such a person names, addresses, and organizations

Figure 2: Business type detection

We often see cases where customers require that the quality of the data be tuned to acceptable levels based on specific use cases. A data scientist may be working on a customer scenario where he or she needs at least 90% of the data in a column to be person names in order for the column to be treated as type Person.

Alternately, a column detected as Organization might need to have only 80% of the data in the column to be organizations. These quality thresholds can be provided on a per-column basis or on the whole data set to control the type detection. For example:

  • A column C1 with 95% person names is detected as type Person, when the user provides a threshold of 90%.
  • A column C2 with 75% organization names can be treated as a String, when the user provides a threshold of 80%.
  • A column C3 containing 100% addresses is detected as Address type.

Extended type system’s mission is to make data ready for analysis. This analysis may require using SQL, performing prescriptive analysis, or building predictive models. In such cases, detecting the type of an unstructured text column and allocating it to a Person or Address field is insufficient. These fields need defined structures for further analysis.’s extended type system provides support for such scenarios. Each semantic type is represented as a Spark struct type. In the above scenario, PersonType contains firstName, middleName and lastName, and the addressType contains street name, city, state, zip.

Figure 3 below shows the types the columns are converted to after applying the threshold.

Figure 3: Automatic conversion to extended business types

As a result,’s extended type system allows downstream analysis over unstructured text columns.

Figure 4 below shows a sample analysis using a Spark SQL command to find all first names from state California, spelled in various ways.

Figure 4: SQL analysis over unstructured text

Column distributions and key statistics

Circling back to the key principle of discovery and visibility into the data, a data descriptive feature provides column distributions and statistics on the entire data set. Data scientists in the process of building predictive models find the column distributions and key statistics such as standard deviations, min, max, and top n values especially useful. Column distributions not only serve the purpose of giving an overview of the data, but they also provide insights into the quality of data.

Data scientists get insight into aspects of each column, such as:

  • Frequency and occurrence of nulls in the fields
  • Frequency and occurrence of unique values for categorical fields
  • Bins representing numerical ranges and distributions per range for numerical values
  • Bins representing date, timestamp ranges, and distributions per range for date and timestamp values
  • Statistics on numeric columns such as max, min, range, stdDev, kurtosis, skewness and others

As revealed earlier, the Age column contains some mismatched values. Figure 5 below shows the mismatches values, converted to null for the integer column, indicating potentially poor quality.

Figure 5: Visualize column statistic for quality assessment

As shown in Figure 6, after fixing the data, visualizations are a confirmation of improved quality (nulls replaced by expected values of the right type, in this case integer).

Figure 6: Confirm improved quality using profiling stores all the discovered metadata from type inference and column distributions in the Spark data frame’s metadata.’s render functions help visualize this metadata into charts and graphs.

In addition to the visualizations seen so far on numeric values, Figure 7 below shows visualizations of categorical content.

Figure 7: View blood pressure and drug values distribution libraries are available for Python and Scala notebooks. Data scientists can use discovery and insights to iterate over the data preparation process. Throughout the process, they can use matplotlib or Brunel libraries, to visualize their data and use metadata to aid with the iterative data preparation process.

Discover the power of!

The library gives data scientists an invaluable tool to jump start the data preparation process along with statistical and qualitative measures of progress with the data set. Using this library simplifies the unavoidable and tedious step of working with structured or unstructured data prior to any Spark analysis. The library along with the metrics provides an intuitive way to leverage the power of data for Spark.

Take a tour with a sample notebook

Take a tour of cleaning structured, semi-structured, unstructured data sets using by using the sample notebook with these steps:

  1. Open the Analyzing data by using the library features sample notebook but do not run it yet.
  2. Add this code as the first cell in the notebook:

from pyspark import SparkContext, SparkConf sc.stop()

conf = (SparkConf().set("", "false"))

sc = SparkContext(conf = conf)

3. Run the notebook.

4. When you run the notebook multiple times, saving the data frame will give an error due to an existing file at the destination location. In the Save the DataFrame section, comment out this line or change the destination location: 

Originally published at on September 15, 2016.