Conquer Your Data Beast: Mastering Data Silos with Databricks ARC and Splink🏆

Published in

Mphasis Datalytyx

9 min readMar 26, 2024

Following on from my last blog post (Raiders of the Last Data Silos: Unleashing the Power of Databricks Lakehouse Federation 🚀 | by Sunny Sharma | Mphasis Datalytyx | Feb, 2024 | Medium), this article will take the next step to tame the data beast!

Do you ever feel like your data is a tangled mess, a monstrous beast refusing to be tamed? Information sprawls across databases, applications, and the cloud, making it a struggle to extract valuable insights. But fear not, weary data warriors, for two powerful tools stand ready to vanquish your data wrangling challenges: Databricks ARC and Splink!

Spotlight on Databricks ARC💡

Our first stop is Databricks ARC (Automated Record Connector), a solution accelerator that’s like a Swiss Army knife for data linking. It connects data from different sources, democratizes data, automates data linking, generates golden records, improves data quality, identifies missing data, and performs probabilistic data de-duplication. In essence, ARC is your one-stop-shop for all things data linking.

Further info here: Automated Record Linking | Databricks

Getting to Know Splink🏛️

Next, we meet Splink, an open-source library developed by the Ministry of Justice (MoJ) for probabilistic record linkage at scale. Splink is like the Usain Bolt of data linking — it’s fast, accurate, and scalable. It can link a million records on a laptop in around a minute. Plus, it’s open source, so you can peek under the hood and see how it works.

Further info here: moj-analytical-services/splink: Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends (github.com)

Databricks ARC: Your Data Matching Sherpa🗺️

Imagine Databricks ARC as your friendly guide through the data jungle. It automates the nitty-gritty tasks that Splink demands, leaving you free to focus on the bigger picture. Think of it as the “easy button” for data deduplication & data matching, streamlining the process and getting you to valuable insights faster.

How Databricks ARC Accelerates Data Quality Initiatives🏎️

One of the key strengths of Databricks ARC is its ability to accelerate data quality initiatives. But how does it do this? Let’s dive in!

Automated Data Linking: ARC automates the process of data/record linking, which is a crucial step in improving data quality. By automating this process, ARC reduces the time and effort required to link data, thereby accelerating data quality initiatives.
Machine Learning Optimization: ARC uses Hyperopt, a powerful tool for tuning machine learning models, to optimize the choice of similarity or distance algorithm. This ensures that the best algorithm is used for data linking, thereby improving the quality of the linked data.
Integration with Databricks and Spark: ARC leverages the distributed computing capabilities of Spark to efficiently search the hyperparameter space. This allows ARC to quickly and efficiently improve data quality across large datasets.
Model Tracking with MLflow: ARC uses MLflow for automatic tracking of the search results. This allows ARC to keep track of which algorithms and hyperparameters yield the best results, thereby continuously improving data quality.
Relationship to Splink: ARC’s linking engine is the UK Ministry of Justice’s open-sourced entity resolution package, Splink. It builds on the technology of Splink by removing the need to manually provide parameters to calibrate an unsupervised de-duplication task, which requires both a deep understanding of entity resolution and good knowledge of the dataset itself.

Databricks ARC uses a combination of automation, machine learning optimization, and integration with powerful tools like Databricks, Spark, and MLflow to accelerate data quality initiatives.

Here’s a code example using Databricks ARC to deduplicate a dataset:


import arc
from arc.autolinker import AutoLinker

# Enable ARC - not needed in DBR 12.2 LTS ML but just in case! 
arc.sql.enable_arc()

# Load your dataframe
df = spark.read.table("my_db.table")

# Create an AutoLinker instance
autolinker = AutoLinker()

# Link the data to find near-duplicates
linked_df = autolinker.auto_link(data=df)

# Explore the linked dataframe
linked_df.show()

Here’s a code example using Databricks ARC to deduplicate two datasets and then link them:

import arc
from arc.autolinker import AutoLinker

# Enable ARC - not needed in DBR 12.2 LTS ML but just in case!
arc.sql.enable_arc()

# Load two Spark DataFrames of data to be deduplicated
data1 = spark.read.table("my_catalog.my_schema.my_duplicated_data1")
data2 = spark.read.table("my_catalog.my_schema.my_duplicated_data2")

# Create an AutoLinker object
autolinker = AutoLinker()

# De-duplicate the first dataset
deduplicated_data1 = autolinker.auto_link(data1)

# De-duplicate the second dataset
deduplicated_data2 = autolinker.auto_link(data2)

# Link the de-duplicated data
linked_data = autolinker.auto_link(deduplicated_data1, deduplicated_data2)

# Explore the linked dataframe
linked_data.show()

Here’s a simple code example using Splink and DuckDB to deduplicate a dataset:

from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on
from splink.datasets import splink_datasets

df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        block_on("surname"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],
}

linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)

blocking_rule_for_training = block_on(["first_name", "surname"])

linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training, estimate_without_term_frequencies=True)

blocking_rule_for_training = block_on("dob")
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training, estimate_without_term_frequencies=True)

pairwise_predictions = linker.predict()

clusters = linker.cluster_pairwise_predictions_at_threshold(pairwise_predictions, 0.95)
clusters.as_pandas_dataframe(limit=5)

If you decide to use either Databricks ARC or Splink then you should become familiar with Jaccard, Jaro-Winkler, and Levenshtein comparisons:

Jaccard Similarity:

Use Jaccard similarity when you’re interested in the proportion of shared features between two sets. It’s ideal for situations where the order of elements doesn’t matter.

Common applications:

Comparing sets of keywords or tags
Identifying duplicate documents with similar content
Recommending products based on user purchase history

Jaro-Winkler Similarity:

Use Jaro-Winkler similarity when you want to compare short strings like names, product titles, or addresses that might have typos or minor variations. It considers character transpositions (swaps) and gives more weight to matches at the beginning of the strings.

Common applications:

De-duplicating customer records with slight variations in names
Fuzzy matching in search engines to handle typos
Identifying similar product names across different vendors

Levenshtein Distance:

Use Levenshtein distance when you need to know the minimum number of edits (insertions, deletions, substitutions) required to transform one string into another. It’s suitable for comparing strings of any length, but doesn’t inherently account for typos or transpositions.

Common applications:

Spell checking and autocorrection
Identifying similar code snippets with minor changes
Comparing DNA sequences for genetic analysis

Choosing the right method:

Consider the type of data you’re comparing: sets of elements, short strings, or longer sequences.
If order matters, Jaccard similarity might not be suitable.
If you expect typos or minor variations, Jaro-Winkler or Levenshtein might be better choices.
Jaro-Winkler is generally faster than Levenshtein for short strings.
Experiment and compare different methods to see which one performs best for your specific use case.

As I hope you now realize, Databricks ARC does the experimentation and comparisons for you and logs the runs in MLFlow!

It does this by leveraging Hyperopt, a powerful tool for tuning machine learning models, to optimize the choice of similarity or distance algorithm.

Here’s how it works:

1 — Hyperparameter Optimization: Hyperopt is used to optimize a model’s accuracy over a space of hyperparameters. In the context of ARC, these hyperparameters could include the choice of similarity or distance algorithm and their specific parameters

2 — Bayesian Optimization: Hyperopt uses Bayesian optimization, meaning it intelligently learns which combinations of values work well as it goes, and focuses the search there. This allows ARC to efficiently find the best similarity or distance algorithm for a given task.

3 — Integration with Databricks and Spark: Hyperopt integrates with Apache Spark for parallel hyperparameter search. This allows ARC to leverage the distributed computing capabilities of Spark to efficiently search the hyperparameter space.

4 — Model Tracking with MLflow: Hyperopt also integrates with MLflow for automatic tracking of the search results. This allows ARC to keep track of which algorithms and hyperparameters yield the best results.

Databricks ARC uses Hyperopt to efficiently and intelligently choose the best similarity or distance algorithm for a given task.

In the Splink and DuckDB example you will have seen references to Blocking Rules.

What are these Blocking Rules and how does Databricks ARC use them?

Databricks ARC uses blocking rules to determine the possible comparison space and reduce the number of pairs to compare. This is a crucial step in the process of data de-duplication and linking.

In traditional systems, users provide SQL-like statements to determine the blocking rules. However, ARC simplifies this process. Instead of requiring the user to provide these statements, the user only needs to provide a parameter for the maximum number of pairs they’re willing to compare.

ARC then identifies all possible blocking rules within that boundary and optimizes for the best one. This approach removes the burden and complexity of choosing blocking rules from the user and automates the process, identifying the best set of parameters tuned to your dataset.

This makes ARC a powerful tool for probabilistic data de-duplication and linking, as it removes the need for any labelled data or subject matter expertise in entity resolution. It’s part of Databricks’ efforts to provide simple, automated, and scalable solutions for data management

OK, let’s say I convinced you to investigate Databricks ARC and Splink further. But how are decisions made and how do you see them?

Let’s inspect how the ML model is making it’s decisions.

The match_weights_chart() shows how much importance ARC is putting on each comparison within each column. Immediately this lets you see which attributes are impacting your model most — dark green bars are comparisons ARC is using as strong signs of a match. Dark red bars are the opposite — strong signs of not a match.

autolinker.match_weights_chart()

We can also inspect individual record’s matching decisions

These waterfall charts give a granular breakdown of how ARC is making it’s decisions.

records_to_view  = autolinker.best_predictions_df.as_record_dict(limit=100)
autolinker.best_linker.waterfall_chart(records_to_view, filter_nulls=False)

Here’s an example of a good match:

Here’s an example of a no match:

Here’s an example that would need further manual investigation:

Accessing the underlying Splink object from ARC

The above charts highlight two of Splinks visualization features. Splink is a richly featured tool with a lot of power and customizability. It is possible to directly access this through the autolinker.best_linker attribute.

This enables a workflow whereby ARC is used to jumpstart a deduplication project — use ARC to build a baseline model through automation, and then manually iterate and improve, increasing your efficiency by using machines to do the initial, easier work.

linker = autolinker.best_linker
linker.m_u_parameters_chart()

In conclusion, Databricks ARC and Splink make a formidable team. The choice between either tool depends on where you are in your data journey and which data platform you are using.

Databricks ARC vs. Splink: A Summary Table⚖️

I would like to thank Robin Linacre (Robin Linacre | LinkedIn) and the MOJ Team for creating Splink. I am truly standing on the shoulders of giants here. IMHO Splink is better, faster, and cheaper (can you beat free?) than some other commercial products I have seen in the past! Please take a look at Robin Linacre’s blog for even more detailed information.

Also, thanks to Robert Whiffin (Robert Whiffin | LinkedIn) and the Databricks ARC team for making me aware of Splink in the first place and making it easier to use with Databricks.

Please note the opinions above are the author’s own and not necessarily my current employer’s opinion. This blog article is intended to generate discussion and dialogue with the audience. If I have inadvertently hurt your feelings in anyway, then I’m sorry.

Conquer Your Data Beast: Mastering Data Silos with Databricks ARC and Splink🏆

Written by Sunny Sharma