Easy data transformations with Optimus

Argenis Leon
9 min readNov 26, 2021

--

Photo by Paxson Woelber on Unsplash

This post is the third of a series of posts about Optimus.

Welcome back!

Last time we explored some of the essential functions for data qualitative and quantitative analysis. We went over how to define your functions within Optimus and apply it to any dataset (you can check it out at this link: https://medium.com/@argenisleon/data-quality-and-eda-with-optimus-32e511629d32 ). This post will cover some aspects of data transformation through data aggregation, filtering, and string clustering.

Aggregations

When working with a large dataset, we need efficient summarization analysis to learn more about the dataset, for this, we have aggregations, which are functions where the values of multiple rows are grouped in some way to form a single value that gives us information about a column. Optimus offers two ways to apply aggregations:

  • Using the already known cols accessor
  • Using the agg function

Optimus supports many aggregations to work with. Here’s the complete list:

As we’ve seen in the first two posts of this series, the cols accessor is a handy tool when working with data frames. But, first, let’s take a look at both aggregation options:

If we would want to know the minimum won games and home runs achieved by the players displayed:

Or the standard deviation of all home runs displayed:

In contrast with traditional dataframe technologies, Optimus will return a Python dictionary containing the column’s name and value instead of a dataframe object, which can be very useful when asking for a single aggregation. Also, we can use agg function when we want to get a set of aggregations.

With this method, you can calculate one or multiple aggregations simultaneously, with less code. For example, if we want to know the maximum and minimum of won games, along with the mean of home runs, we can write the following code:

These aggregations will be performed on the whole dataset, without any groups being taken into account. So now, let’s learn what to do if we want to limit the results to some groups.

Filtering by grouping

When we want to analyze within more specific areas of the dataset, we need some tools to filter the areas we want to explore. In other words, sometimes, we want to apply aggregations in a dataset while considering a common set of values in another column. Here, we can use a practice called grouping with the groupby function.

For example, imagine finding the maximum of home runs achieved by every player displayed in df. For this, the agg function supports the groupby parameter, which will use the expected values in the “Name” column to calculate the maximum value in the “Home runs” column.

After getting this Python dictionary, you can access any player values by using the following command:

These aggregation and grouping functions will give you a lot of advantages to wrangle your data and shape it the way you need to get quantitative insights from it and reformat your dataset for different uses.

String clustering

String clustering refers to different grouping values that might be alternative representations of the same thing. An excellent example of this is the strings “NYC” and “New York City”. In this case, they refer to the same thing.

When wrangling data, we frequently find columns that look as they have similar values, but they don’t. To handle this issue, Optimus gives you some handy techniques to easily detect which strings are similar and group them, giving you some options that could point to the best value in the group. We will explore all these techniques in this section.

String clustering may be one of the most underrated data-cleansing functions. To explain what string clustering is for, we can refer to the definition from OpenRefine, which says:

[It] finds groups of different values that could be alternative representations of the same thing”.

For example:

In df2, we see that the first three names are suitable for one specific person (Walter White). On the other hand, the last two names may refer to another person (Gustavo Fring). Optimus will give you the tools to apply different string-clustering methods, suggest a value that best represents what you want, and then replace the values to achieve a cohesive representation of the data.

Optimus allows us to use different string-clustering methods, from some fast and less accurate methods such as fingerprinting to more advanced ones such as Levenshtein distance. Which one you will use depends on your use case, the size of the data at hand, and the computing power available to you.

All these methods tend to be a combination of operations over strings already defined in Optimus, such as lowercase or remove_special_symbols. In Optimus, we divide the methods available for that practice into two groups:

  • Key collision methods.
  • Nearest-neighbor methods.

In this post, we’ll be reviewing some of the most used string clustering methods, but there are a ton more (many of them explained in our book!).

Some key collision methods

Key collision methods are based on the idea of creating a reduced and meaningful representation of a value (a key) and putting equal ones together in buckets. Optimus has implemented three methods of this sort: fingerprinting, n-gram fingerprinting, and phonetic fingerprinting.

Fingerprint

The fingerprinting method is the least likely to generate false positives, which is why Optimus defaults to this.

The process that generates a key from a string value is outlined here and must be followed in this order:

  1. Remove leading and trailing whitespace (for example, from “Walter White” to “Walter White”).
  2. Change all characters to their lowercase representation (for example, from “Gus Fring” to “gus fring”).
  3. Remove all punctuation and control characters (characters that help to give form to the text but cannot be seen, such as a tab or a carriage return, among others).
  4. Update extended western characters with their American Standard Code for Information Interchange (ASCII) representation (for example, from “güs” to “gus”).
  5. Divide the text string into individual tokens for every word after whitespace (for example, from “walter white” to [“walter”, “white”]).
  6. Sort and remove duplicates in the tokens (for example, [“gus”, “fring”] to [“fring”, “gus”]. Because “f” is ‘lower’ in order than “p”).
  7. Finally, club the tokens back together. The result would be: [“fring gus”].

Now that we know how fingerprinting methods work let’s look at an example using df2 through Optimus.

As we can see, the number of strings we had gone from five to three, so we reduced our data set in some measure. In this result, we’re getting the clusters in a dictionary. It’s also possible to apply the fingerprint method to a specific column:

It’s essential to notice a more sophisticated fingerprint method called the N-gram fingerprinting method, explained in detail in our book. For this post, we won’t be reviewing that method.

N-gram fingerprinting

To understand this method, we should first talk about n-grams. An n-gram can be a sequence of n things. In this case, an n-gram is a sequence of characters. For example, a 2-gram of the “heisenberg” string would be: [“he”, “ei”, “is”, “se”, “en”, “nb”, “be”, “er”, “rg”].

As you can see, to build a 2-gram you need to take the two first characters of a string, then take the last character of the last two characters you took and add the next one, and repeat the process until you get to the end of the string.

The algorithm to do an n-gram fingerprint follows these steps:

  1. First, change all characters to their lowercase representation.
  2. Remove all punctuation, whitespace, and control characters (characters that help to give form to the text but cannot be seen, such as a tab or a carriage return, among others).
  3. Create all string n-grams.
  4. Sort the n-grams and remove duplicates.
  5. Joins the sorted n-grams back together as a string.
  6. Finally, it converts extended western characters to their ASCII representation.

In a word, with this method, you want to separate a string into small pieces, specifically, parts of length “n”. In practice, this method is helpful since there is no advantage of using great values for n-grams compared to a fingerprinting method. Nevertheless, using 2-grams and 1-grams can find clusters that earlier methods couldn’t, even with strings that have minor differences, although they do yield several false positives.

Let’s take a look at how this method works:

We can notice that n-gram fingerprinting can be more potent than the earlier method because, as an extension of fingerprinting, it can match a pair of strings separated by minor differences, such as typos.

Phonetic fingerprinting

Phonetic fingerprinting is a method to encode strings into a representation that better matches the way they are pronounced. The main goal is to bucket similar-sounding words and sentences. For example, “Reuben Meza” and “Ruben Mesa” share the same phonetic fingerprint for English pronunciation, but they have different fingerprints for both the preceding regular and n-gram fingerprinting methods.

There are several phonetic methods that Optimus supports, such as the following:

  • Soundex
  • Metaphone
  • Double Metaphone
  • Match Rating Codex
  • New York State Identification and Intelligence System (NYSIIS)

Nevertheless, in this post, we’re only going to explore the fastest and most accurate method of the above: Metaphone.

Metaphone is a phonetic algorithm created by Lawrence Phillips in 1990 and is used for assigning indexes to words using their English pronunciation. It uses various minor variances or inconsistent spelling and pronunciation errors to generate a more accurate encoding, which can be used with similar words or names. Unfortunately, the algorithm itself is pretty lengthy and consists of a bunch of comparisons and replacements. However, if you want to take a look, you can learn more about it at https://en.wikipedia.org/wiki/Metaphone.

In Optimus, you would apply string clustering using Metaphone like this:

As we can see, Metaphone method uses the literal pronunciation of words spoken in English. Therefore, this method is convenient when there are many typos or so.

Nearest-neighbor methods

A nearest-neighbor method gives a parameter that represents the threshold of the distance between strings; any string pairs that have a distance value closer to the specified one will be grouped, as illustrated in the following image.

For k-Nearest Neighbor (kNN) methods, Optimus implements Levenshtein distance. Let’s see how this works.

The Levenshtein distance between two words is calculated as the minimum number of single-character changes that need to be done to a word to convert it into another. In this example, let’s look at the necessary steps to transform a string, “AABBCC”, to “ABZ”. Let’s refer to “AABBCC” as Str1 and “ABZ” as Str2. The method proceeds as follows:

  1. First, delete one letter “A” from Str1 (from “AABBCC” to “ABBCC”).
  2. Delete one letter “B” from Str1 (from “ABBCC” to “ABCC”).
  3. Third, delete one letter “C” from Str1 (from “ABCC” to “ABC”).
  4. Finally, replace the letter “C” in Str1 with the “Z” from Str2 (from “ABC” to “ABZ”).

Optimus will calculate the Levenshtein distance between all the strings in the columns and select the ones with the shortest distance, as follows:

We got the same result as the Fingerprint method but used a different technique. However, this algorithm can be a lot more accurate than the fingerprint one because it uses other quality properties of the strings in hand that can lead to more specific results and, consequently, more precise ones. It is known that the fingerprint and n-gram fingerprint methods are fast but lack accuracy, whereas Metaphone is fast and accurate, and the Levenshtein distance is slower but accurate.

Now you know how to use Optimus to make a better quantitative analysis of your datasets, use aggregations, filtering, and how to efficiently reduce the size of your dataset with string clustering, as well.

In the next post, we will explore how to handle outliers in data and plotting.

See you next week!

Do you want to learn more about Optimus?

If you want a deep dive explanation about the +100 functions in Optimus, consider reading our book available on Amazon https://www.amazon.com/Data-Processing-Optimus-Supercharge-preparation-dp-1801079560/dp/1801079560/ref=mt_other?_encoding=UTF8&me=&qid=1628683187.

--

--