Data quality and EDA with Optimus

Argenis Leon
7 min readNov 19, 2021

--

Photo by Jasper Wilde on Unsplash

This post is the second of a series of posts about Optimus.

Welcome back!

Last time we introduced Optimus, how to install it and describe essential features and functions to get started (you can check it out in this link: Easy data processing at scale with Optimus | by Argenis Leon | Nov 2021 | Medium). Now it is time to dive a little deeper into the fascinating library of Optimus.

Data quality

As we glimpsed last time, Optimus can detect a wide range of data types within the dataset. Optimus can then count the number of values in a column that match a specific profiler data type; this gives you the ‘quality of the data. In Optimus, we use the term quality to express three data characteristics:

  1. Match: number of values that match the data type being inferred.
  2. Mismatch: number of values that differ from the data type being inferred.
  3. Missing: number of missing values.

For example, if the profiler data type in a column is Email, Optimus will count the number of values in a column that does the following:

  • Match the email format, such as: “dog@wuff.com”.
  • Do not match the email format: “dog@wuff”.
  • It will also count the null values.

Here’s an example of how the qualitycommand works:

As we can see, the quality function returns a dictionary, with the column names as the keys, containing matches, mismatches, missing values, and profiler_dtype, which is the abstract type inferred by Optimus. As we so in the introductory post of this series, Optimus can infer various data types; here’s the full list:

  • Integer
  • String
  • Email
  • URL
  • Gender
  • Boolean
  • US zip code
  • Credit card number
  • Time and date format
  • Object
  • Array
  • Phone number
  • Social security number
  • HTTP code

Now, it may be the case that the data type inferred by Optimus is not the one you need in your dataset and that the mismatches are, in fact, the data type you intend the column to have, so if you want to change the data type inferred by the library in one or more columns to have fewer mismatches in the data quality, Optimus has the flexibility for this to be possible with the function set_dtype(). For example, if we would want to change the data type of the column named “Zipcode” from ‘zip_code’ to ‘int,’ we write the code:

df = df.cols.set_dtype("Zip code", "int")

Moreover, if we would want to change the data type of multiple columns, we write:

df = df.cols.set_dtype({"Zip code": "int", "Credit card number": "str"})

It’s important to notice that particular data types are internally treated as strings but are constrained by a format, such as emails, URLs, and some DateTime values.

It is also possible to expand the data types supported by Optimus. We will see more about this topic in the following posts.

Data Profiling

Optimus has useful functions that return stats about the dataset. This function is profile and here’s how it works, the following code will produce a Python dictionary that contains stats about every column in “df”:

As you can see, profile will return a Python dictionary where you can find specifics about every column and stats about the whole DataFrame.

Talking about DataFrame stats, you can always use profile.summary() to get the following:

  • cols_count: Number of columns in the DataFrame.
  • rows_count: Number of rows in the DataFrame.
  • dtypes_list: List of data types in the DataFrame.
  • total_count_dtypes: Count of data types in the DataFrame.
  • missing_count: Number of missing values in the DataFrame.
  • p_missing: Percentage of missing values in the DataFrame.

The following example illustrates how it works:

Using profile.columns() will return valuable information about every column in the DataFrame through two dictionary keys: stats and dtype.

The stats key contains the following info:

  • match: Number of values in the columns that match the profiler_dtype.
  • missing: Number of missing values.
  • mismatch: Number of values in the column that do not match the profiler_dtype, excluding null values.
  • inferred_data_type: Data type inferred by Optimus.
  • count_uniques: Number of unique values in the dataframe.
  • frequency: Top n values in descending order (if applicable).
  • hist: Density of value in every bin (if applicable).

Optimus will calculate a frequency or a histogram depending on the datatype. For example, it will calculate a histogram for numeric data types and a frequency for string datatypes. Let’s look at an example:

So, with these profiling functions, it is more than easy to explore the dataset you are working with, extract its quality data and, obtain helpful information about it for transforming and understanding the various relations that you may encounter through analysis.

String, numeric and statistical functions

Optimus also has a variety of string and numeric functions that are helpful for every operation you can think of when doing some data exploration and transformation; most of them are pretty simple but powerful. Let’s take a look.

String

The string function is self-descriptive. Let’s see all the functions and some brief descriptions of each. We’ll look at examples of just a few of them since some operations are pretty similar:

For example, if we’d want to replace all the lower case letters with upper case, we can write:

We can also create a new column with the output of the function instead of replacing the original:

All the string functions can be applied to any data type. For example, you can call df.cols.padto an integer column, and Optimus will transform the column’s data type to string before applying the pad function:

Optimus tries to be less strict and converts the data type instead of raising an error for operating columns that could be incompatible. This behavior helps concatenate the string values with the numeric values of another; we can always use the nest and unnest functions.

Nesting and unnesting

When working with dates, names, addresses, or any reasonably complex string data, merging and splitting can be helpful. Optimus provides two functions for this: nest and unnest. Let’s look at some examples:

On the other hand, we also have numeric functions, divided into two groups: mathematical and trigonometric. These functions applied to a single column or several columns will return the same column in the dataframe. As the functions are self-explanatory, we are going to illustrate with two simple examples.

Mathematical

Optimus provides plenty of mathematical functions used in many contexts for data wrangling and data transformation; these functions are also helpful for quantitative analysis. The following are the ones supported by Optimus:

For example, if we’d want to calculate the exponential of the column “depth”, we could write the following:

Optimus also offers many trigonometric functions that could be useful for data analysis. Let’s talk about those.

Trigonometric

Optimus supports the most basic ones (sine, cosine, and tangent) and their respective inverse and hyperbolic functions. If needed, you can also include the other trigonometric functions by using their formulas induced by the basic ones shown here.

For example, to calculate the cosine of all the values contained in column “Credit Card number” in df, we can write the following:

Statistical

Optimus provides a wide range of statistical functions that operate the same way as the previously shown function; they are applied to all the values contained in one or more columns. All these functions are handy for Exploratory Data Analysis (EDA) or simplified statistical data analysis. There are four ways to categorize EDA:

  • Single variable, non-graphical
  • Single variable, graphical
  • Multivariate, non-graphical
  • Multivariate, graphical

In this post, we will only review the single variable area. For non-graphical methods, we have the following set of functions:

For example, to calculate the interquartile range of the distribution generated by the column “x” in “df2”, we could write the following:

In the same matter, Optimus also has functions for graphical methods:

We use the function plot to plot these functions applied to a set of columns or a single column. Here’s an example of a box plot of the distribution generated by the column “y” in “df2”:

Remember that the statistical functions shown in this post are just for single variable analysis, but Optimus has a whole set for multivariate analysis.

User-defined Functions (UDF)

In Optimus, you can also define your own functions. To apply a UDF, you can use the df.cols.apply method. For example, to establish a function that adds 7 to our whole column, we run the following code:

Now we will apply our new function to the column of a dataframe already defined:

As you can see, the output was the same dataframe but with the values of the “depth” column increased by seven.

It’s vital to notice that when you write your own UDF, you must handle the data that’s passed to the function. Because we are using the Pandas engine, the value passed to the function is a Pandas series that supports the plus operator executed in a vectorized way.

So, now you have tasted a bit more of Optimus through its very intuitive functions for data analysis: quantitative or qualitative. You also learned how to define a personalized function and apply it to a dataset.

In the next post, we will explore some other areas of Optimus, such as data aggregation, filtering and grouping, and string clustering.

See you next week!

Do you want to learn more about Optimus?

If you want a deep dive explanation about the +100 functions in Optimus, consider reading our book available on Amazon https://www.amazon.com/Data-Processing-Optimus-Supercharge-preparation-dp-1801079560/dp/1801079560/ref=mt_other?_encoding=UTF8&me=&qid=1628683187.

--

--