Cobbling Together a User ID From Multiple Sources

How We Unify User Data Across Multiple Platforms

Published in

Dandy Engineering, Product & Data Blog

6 min readDec 9, 2022

With all the various SaaS solutions emerging in today’s market, many businesses find themselves juggling 10+ tools in their tech stacks. At Dandy, our customer experience (CX) department is spread across four different platforms–Paylocity, Playvox, Kustomer, and Dialpad–all of which ingest A LOT of data. Ideally we’d like to leverage this data to answer questions like “How many people should we staff at which hours of the day?” or “How productive is each person from week to week?”

But before we can run analyses on this data, we need a way to unify all the user data generated by these services. People’s names, emails, and even IDs change over time, making it difficult to keep track of entities spread across different platforms. This presents us with an interesting data problem: how do we leverage user data from four different data streams and tie them all to a unifying, universal ID?

The Data We Have

Depending on the platform, we have user data in one of the two following formats:

Event logs

A fact table that logs every event triggered by users:

2. User Table

A dimension table that keeps one record for every user, giving us information such as when the record was created and when it was updated:

Prepping the Base Data

Prepping Event Logs

To get this table into a state suitable for our purposes, we need to turn it into a historical snapshot. So, we run the following query:

WITH platform as (
  SELECT
    user_id
    , user_name
    , user_email
    , MIN(timestamp) AS updated_at
  FROM record_log
  GROUP BY 1,2,3
)

This gives us a record of every row where either the user_id, user_name, or user_email changed, as well as the timestamp that each change occurred at. One tricky thing to note here is that we want the minimum and not the maximum timestamp, because we want to identify when the changed information first appeared. With that, we’ve constructed a historical snapshot of sorts from the event log.

Prepping User Tables

This table is fine as is, though we miss out on any data for the changes that happened to this table. DBT easily enables us to snapshot changes to our tables, so we’ll be using that later on to capture changes to the final table.

Unifying Disparate Sources

The first problem we face when creating a unified table is deciding which field to join everything together on. In order to use each service, our users are required to use their company email addresses to sign in. But various life circumstances lead to things like names and email addresses changing. We’ll handle edge cases like email changes in the next section.

, everything_together AS (
  SELECT
    COALESCE(paylocity.email, kustomer.email, dialpad.email, playvox.email) AS email
    , paylocity.user_id AS paylocity__user_id
    , paylocity.user_name AS paylocity__user_name
    , kustomer.user_id AS kustomer__user_id
    , kustomer.user_name AS kustomer__user_name
    , dialpad.user_id AS dialpad__user_id
    , dialpad.user_name AS dialpad__user_name
    , playvox.user_id AS playvox__user_id
    , playvox.user_name AS playvox__user_name
    , LEAST(
        COALESCE(dialpad_created_at, '3000–01–01'::TIMESTAMP_TZ)
        , COALESCE(kustomer_created_at, '3000–01–01'::TIMESTAMP_TZ)
        , COALESCE(paylocity_updated_at, '3000–01–01'::TIMESTAMP_TZ)
        , COALESCE(dialpad_updated_at, '3000–01–01'::TIMESTAMP_TZ)
        , COALESCE(playvox_updated_at, '3000–01–01'::TIMESTAMP_TZ)
      ) AS aggregate_created_at
  FROM paylocity
  FULL OUTER JOIN kustomer
  USING (email)
  FULL OUTER JOIN dialpad
  USING (email)
  FULL OUTER JOIN playvox
  USING (email)
)

Not every user exists on every platform, so we utilize a full outer join to bring all these tables together.

We also built an “aggregate created at” date in this query. In building it, we’ll have some rows with nulls for almost every created_at/updated_at field and that’s okay. We take care of that by coalescing the null date fields with January 1st, 3000.

Creating a Temporary ID

Because the creation date of a user is frozen in time, it gives us an immutable basis for a universal ID.

Let’s ignore for now the possibility that we have redundant records in the everything_together table of the same user but with different aggregate_created_at timestamps. We’ll handle this with deduping logic later on.

To create the universal_id, we first create a temp_id by running a row_number() on every record, ordered by the aggregate_created_at timestamp.

, generate_uuid AS (
  SELECT
    *
    , ROW_NUMBER() OVER (ORDER BY aggregate_created_at ASC) AS temp_id
  FROM everything_together
)

Another approach is to run dense_rank() or rank() instead of row_number(), which is also a valid decision if you’re faced with the problem of new records inserting above older records when new user data flows in. Because this is the case with our data at Dandy, I found that the solution was to use dense_rank() over the aggregated_created_at timestamp instead.

Linking All Lookalike Users Back to ONE Universal ID

To link all user records of the same person to the same Universal ID, we employ the following strategy:

Grab the minimum temp_id across all users with the same Kustomer/Paylocity/Dialpad/Playvox ID. We call this dedupe_id_1.
Grab the minimum temp_id across all users with the same Kustomer/Paylocity/Dialpad/Playvox name. We call this dedupe_id_2.
Filter to just the newest record across all records with the same dedupe_id_1 or dedupe_id_2, and re-alias dedupe_id_2 as universal_id

, find_lookalike_id AS (
  SELECT
    *
    , COALESCE(
      identifier__kustomer_user_id
      , identifier__dialpad_user_id
      , identifier__playvox_user_id
      , identifier__paylocity_id
      ) AS aggregate_id
    , MIN(temp_id) OVER (PARTITION BY aggregate_id) AS dedupe_id_1
  FROM generate_uuid
)

, find_lookalike_name AS (
  SELECT
    *
    , COALESCE(kustomer__name, playvox__name, dialpad__name, paylocity__name) AS aggregate_name
    , MIN(dedupe_id_1) OVER (PARTITION BY aggregate_name) AS dedupe_id_2
  FROM find_lookalike_id
)
 
, final AS (
  SELECT
    *
    , dedupe_id_2 AS universal_id
    , GREATEST(
      COALESCE(kustomer_updated_at, '1900–01–01'::TIMESTAMP_TZ)
      , COALESCE(dialpad_updated_at, '1900–01–01'::TIMESTAMP_TZ)
      , COALESCE(paylocity_updated_at, '1900–01–01'::TIMESTAMP_TZ)
      , COALESCE(playvox_updated_at, '1900–01–01'::TIMESTAMP_TZ)
    ) AS aggregate_updated_at
  FROM find_lookalike_name
  QUALIFY 1 = ROW_NUMBER() OVER (PARTITION BY dedupe_id_2 ORDER BY aggregate_updated_at DESC
                                                                , kustomer_updated_at DESC
                                                                , dialpad_updated_at DESC
                                                                , paylocity_updated_at DESC
                                                                , playvox_updated_at DESC)
)

And that, is how we create a universal ID that’s immune to changes to user IDs, name, or email address.

DBT Snapshotting for a True Historical Record

The last step is to implement historical snapshotting of this final table, so that we have a true historical record of every change in ID, name, or email address for every user.

In DBT, we do this by creating a snapshot file in the snapshots folder of our DBT repository, and inserting the following code snippet:

{% snapshot snapshot_universal_user %}
{{
    config(
    strategy='check',
    unique_key='universal_id',
    check_cols=['email'
    , 'kustomer_user_id'
    , 'playvox_user_id'
    , 'dialpad_user_id'
    , 'paylocity_id'
    , 'kustomer__name'
    , 'playvox__name'
    , 'dialpad__name'
    , 'paylocity__name']
    )
}}
SELECT * FROM {{ ref('universal_user_table') }}
{% endsnapshot %}

What makes this snapshot work is the Universal ID. By making it the primary key, we can then reliably track changes to all the other IDs, name, and email. In keeping a historical record, we are able to unify historical user data with current user data, so we have a way of knowing that Sid Sloth’s records are the same as Sid Sloth-Man’s records too. This snapshotting will also capture changes to the dimensional User Table, making it redundant to snapshot those tables before ingesting them in this model.

This approach is certainly not fool-proof, and you may need to bake extra code in to take care of edge cases that arise in the data you work with. The code I’ve presented here is not meant to be an out-of-the-box solution but rather a conceptual template. Hopefully, it will go on to provide the foundation for a universal user ID that empowers your business to analyze user-related data from different sources.

Many thanks to Emily Brantner, Michael Suey, Mirte Kraaijkamp, and Jeff for their thorough feedback during the writing process of this post.