Imputing Missing Values for Categorical Attributes, at Scale

Published in

Compass True North

9 min readJan 4, 2022

Rushi Bhatt, Panos Ipeirotis, Bala Kishore, Sunil Patil

TL;DR: In this blog post, we discuss how we can impute missing values at scale using Spark, CUBE, and frequent itemsets. We show how to “Spark-ify” the process and infer functional dependencies and demonstrate how our approach leverages “for free” the parallelization and scalability inherent in the Spark computational framework.

One of the common issues in many databases is the extensive number of empty values.

While sometimes these values are genuinely unknown or uncertain, we can often infer these values by simply looking at other entries in the database. We can usually find the same information but with more attributes filled in.

For example, in our transaction database, consider the following entries for real-estate agent activity:

In this case, all these entries correspond to the same agent, and it is possible to infer the missing values for email, phone, and company, by looking at the other entries in our database.

The key idea is to extract rules (aka functional dependencies) from the data that have the following structure:

IF attr_A="a" AND attr_B="b" AND ... attr_X = "x" THEN attr_Target ="t"

For example, in the agent profiles context, a rule may look like this:

IF email="xxx@yyy.com" AND "mobile"="555-555-5555" THEN "license_num=007"

But, of course, when we have hundreds of millions of such entries, we want to automate the process of inferring such rules and have techniques that scale both algorithmically and over our existing computational infrastructure that relies heavily on Spark for data processing.

In this blog post, we will discuss how we can perform inference of functional dependencies at scale to impute missing values and show how we can “Spark-ify” the process so that we can leverage the parallelization and scalability inherent in the Spark computational framework.

Imputing Missing Values using Inferred Functional Dependencies

Rules, Confidence, and Probabilities

To calculate the rules, we will need to calculate conditional probabilities of the form:

which is the probability that the target attribute T has the value t when the values of the other attributes are A="a", B="b" and so on. We call this value the confidence of the rule.

Following a pure frequentist approach, we calculate the confidence of the rule as:

The frequency of these attribute-value combinations defines the support of the rule.

In our scenario, we want to extract rules with very high confidence (often requiring confidence to be 1.0) and reasonably high support to avoid extracting high-confidence but noisy rules.

In practice, though, when filling missing values, we want to calculate the probability of the target variable having a value t, given A=a, B=b, etc. but also conditional on the target variable has a non-empty value. In other words:

To calculate the confidence in the presence of missing values, we therefore need

Calculating Frequencies of Attribute-Value Combinations

As we can see above, the critical thing that we need to calculate is the frequency of attribute-value sets (aka itemsets). We can calculate frequencies of attribute combinations within Spark using the CUBE operator, which effectively does a GROUP BY A, B, C ... N across all possible 2^N attribute combinations.

This query is reasonably fast: In our database of almost 200 million records, the query generates around 400 million itemsets and runs in approximately 10 minutes for generating all combinations of attribute-value pairs that appear in the data. The output of this query is a dataframe called “itemsets” that looks like this:

Each row in this table is an “itemset,” which is a set of attribute-value pairs. For example:

email="klrw...@kw.com" license_num="2715100.." count = 9942

means that the combination of these two attribute-value pairs for email and license_num appears 9942 times in the original dataset.

Now, consider these two itemsets:

email="klrw...@kw.com" name="chris arms..." license_num="2715100.." count = 9822
email="klrw...@kw.com" name="N/A" license_num="2715100.." count = 68

In this case, we have itemsets with three elements (email, name, and license_num). In the second itemset, the value for the “name” attribute is missing. Using these two itemsets together with the earlier one, we can calculate the support and confidence of the following high-confidence rule:

IF email="klrw...@kw.com" AND license_num="2715100.." THEN name="chris arms..."

(support 9822, confidence=9822/(9942-68)=99.47%)

The confidence value means that we can fill the empty value for name with the value chris arms... and be 99.47% confident that we are correct.

Let’s also look at a related itemset for the remaining 100%–99.47% = 0.53% of the confidence:

email="klrw...@kw.com" name="fred arms..." license_num="2715100.." count = 52

So, the high confidence rule that we inferred earlier can be used to “fill” not only the missing value in the 68 occurrences where the “name” attribute is missing but also give a hint that the name fred arms...is an alias for the name chris arms..., when appearing in the context of email="klrw..@kw.com" and license_num="2715100..".

Efficient Calculation of Rule Confidence using Spark

Remember that the confidence of each rule is:

In the itemset table, we have the different itemsets and their frequencies.

Remember, the itemset table is a 400 million row table.

To calculate the confidence, we need to match together rows that match in all attribute values except for one.

The following expression achieves the goal succinctly by taking each possible equality condition for the attributes, casting the result to an integer (1 if true, 0 otherwise), and summing them up: The summation should be equal to the number of items in the smaller itemset. Since the itemsets cannot match on the “N/A” value, this condition guarantees that the matching itemsets have identical attribute-value pairs for exactly n-1 attributes.

COALESCE(cast((T1.email = T2.email) AS int),0) +
COALESCE(cast((T1.mobile = T2.mobile) AS int),0) + 
COALESCE(cast((T1.phone = T2.phone) AS int),0) + COALESCE(cast((T1.contact_name = T2.contact_name) AS int),0) +
COALESCE(cast((T1.license = T2.license) AS int),0) +
COALESCE(cast((T1.mris_agent_id = T2.mris_agent_id) AS int),0) 
 = T1.itemset_size
AND T1.itemset_size = T2.itemset_size - 1  
AND T1.missing_values = 0 AND T2.missing_values = 1

Query optimization? Sorry…

Unfortunately, this way of writing the join condition works very poorly in practice for even medium-sized datasets.

Spark cannot optimize a join condition that uses COALESCE, & CAST functions and the summation condition. Spark treats the join condition as a UDF, and it ends up evaluating a cross-join on a table with almost half a billion rows. That is a no-go.

Rewrite the Join Condition using Disjuncts

We can rewrite the condition, avoiding using functions in the join condition. We can generate a long disjunctive query like this, requiring matching of n-1 out of the n attributes of the itemset:

(
 T1.email=T2.email AND T1.name=T2.name AND T1.license=T2.license
OR
 T1.email=T2.email AND T1.name=T2.name AND T1.mobile=T2.mobile
OR
...
OR
 T1.name=T2.name AND T1.license=T2.license AND T1.mobile=T2.mobile
)
AND T1.itemset_size = T2.itemset_size - 1  
AND T1.missing_values = 0 AND T2.missing_values = 1

Unfortunately, even the big disjunctive query with the OR conditions is not something the optimizer can handle well. The query still takes ages to run.

How to Optimize?

One solution is to replace the single join with multiple OR-separated conditions with a union of queries, where each query only uses conjunction (AND condition) as the join condition.

The trick above gets us 90% to our goal, but we end up creating too many queries to combine with a union. In the end, we have 2^Nqueries that we need to merge using the UNION operator, with N being the number of attributes in our table.

Instead of creating 2^N queries, we can rely instead on using “approximate” conditions for the join and create N queries instead of 2^N . To achieve that goal, instead of creating AND conditions on multiple attributes, we only use one attribute at a time. So, we create one join simply on email using the equijoin condition T1.email = T2.email and repeat for all attributes. We repeat for all attributes, and we end up with n joins, where n is the number of attributes. Then we take the UNION ALLof these dataframes.

When joining on only a single attribute, the query returns many “false matches,” as it only guarantees that we match on a single attribute, not n-1 as we wanted. However, we can now add back as a non-join condition the exact filter (with the COALESCE and CAST) that we used earlier, which will be evaluated after the approximate, relaxed equijoin.

This approach returns a query that is then trivially optimized by Spark, using standard equijoins. When the cardinality of the attributes is high (i.e., many different values), the joins run very fast and can be easily distributed among many workers.

(Note: You may still want to use combinations of attributes as the join condition when some attribute-value pairs have a high frequency. For example, when we used the attribute company , there were company names that appeared hundreds of thousands of times. In such cases, the JOIN ON company equijoin returns many matches, with the majority being false matches. In such cases, it is better to combine attributes with high-frequency attribute-value pairs with additional attributes for the equijoin queries; or skip using such attributes altogether for the joins.)

So, the final join query ends up looking something like that, which is a format that Spark can optimize easily:

In our database, with 200 million rows and 400 million itemsets, the complete process of calculating the confidence of all rules (without any limit on support or confidence) takes about 15 minutes.

Generated Rules

Using the techniques described above, we can compute support and confidence for all rules. In our dataset, we have about 80 million generated rules.

Of course, many of them have very low support or low confidence, so they are not helpful for our purposes. Below you can see a visualization that shows the number of rules (y-axis) that have certain levels of support (x-axis) and their respective confidence (color).

Storing and Applying Rules

We store our rules again as a table, which looks like this:

So, the line with the itemset:

email='darrell@...srealty.net'
mobile='86429533..'
license_num='111..'
target='license_num'
confidence=1.0
support=1043
count_missing=2565

represents the rule

IF email=’darrell@...srealty.net’ AND mobile=’86429533..'
THEN license_num=’111..'

and we know that the confidence of the rule is 100%, with support 1043 and with 2565 rows where the email and mobile attribute match, but the license_number is null. (Remember, we can quickly get these statistics from the itemsets table).

One of the advantages of storing the rules in a table format is that applying these rules to the original dataset is another join like the one described above with all the scalability advantages it provides.

Results

In our initial effort, our goal was to fill in missing values, using only highly confident rules (i.e., 100% confidence in the training set) that have strong support (more than 100 supporting tuples). How did it work?

20M filled-in emails out of 50M missing
5M filled-in phones out of 30M missing
52M filled-in mobile phones out of 100M missing
2M filled-in contact names out of 5M missing
32M filled-in license numbers out of 90M missing

Next Steps

Of course, this is just the easy first step. The complete process involves clustering together entries that correspond to the same real-world entity (i.e., consider the case of “Chris Arms…” and “Fred Arms…” mentioned earlier), identifying erroneous entries, and many other tasks, which we will address in future blog posts.