Marketing Attribution Data Model for B2b company.

Vladimir Kobzev
7 min readApr 1, 2024

--

My Previous Medium post: what is marketing attribution, what is uniq for b2b, and how might it be helpful?

Here is the big picture with maximum complexity in case you use more than one web analytics events source simultaneously: ga4 and Mixpanel, additional custom spreadsheets (for the outbound and offline conferences and other input), and 20+ paid ad channels.

Marketing Attribution Data Transformation Lineage

Content: Events data extraction. Events Mapping to Revenue Big Table. Events from entities. Match Sales|Revenue data from CRM. Attribution metrics calculation.

Data Extraction

Google Analytics.

Btw. If you use GA4, please don’t forget to set up the user ID. Guide

I like to build attribution on ClikcHouse. If you also like it. Here is the type of cast that you need to do for this extraction. Here is the code for preparing data to send to clickhouse. Or you can use airbyte as a connector to BigQuery or via gcs

Mixpanel

Row Events Export endpoint. You can export it to BigQuery or Clickhouse.

Offline|Sales Events Extraction

For SalesFource, HubSpot, ZohoCRM, NetSuite, and whatever CRM you can use:

Before moving to mapping events to One Big Table, I need to explain why it is so cool and how it saves hours, days, weeks, and maybe years of engineering.

Introducing: “Revenue Big Table”. The Batman to Fight Marketing Attribution Nightmare.

Challenge. Let me introduce a classic data engineer nightmare: “The sales team decided to add a field to salesforce that needs to be used for output calculations.” This is it. It causes enormous work to pass the field from the beginning to the end of the calculation.

Solution. With the genius idea of Activity Schema. I think “Revenue Big Table” could be a good name for the attribution use case. Rows examples:

Revenue Big Table. Few rows example

Here is the Data contracts for the Revenue Big Table Structure spec:

Event Id

Data quality: Ensure that the Event ID is uniq and suitable for many things. Make this test a priority.

Activity. The field that Defines Business Logic.

Every type of activity usually has its usual features. For example, “Web View” has source info, and “Sales Qualified” has “opportunity amount.”

Data Quality Monitoring Practice. It’s good practice to have an explicit list of “expected activities” And an alert if a new name is added. But Please! don’t Restrict or filter wrong names. It reduces QA possibilities a lot. Just imagine how cool it is to realize that data is incorrect because the GA4 changed the spec or the sales team changed the stage name.

Prospect ID.

Standard options: domain, email, CRM_account_id, or ga4_user_id.

It might be a more complicated synthetic ID. For example, hash that solves situations like if there is CRM_account_id, we use it; if not, we use ga4_user_id

Data Quality: no rows with empty Prospect IDs are good.

Features

The best feature structure for features is a Map(Key, Value).

Clickhouse does have it, but BigQuery doesn’t. But gbq has array of tuples. I wrote simple code for the dbt function (macros) that accesses the feature by key. I decided to avoid working in JSON here. BC of global climate change AI asked me to reduce string parsing as reasonably as possible.

BTW. Snowflake and Databrikc do have a map() as a type.

Features Data Quality: check that the existent feature keys list matches the Activity type. For example, “Web Event” shouldn’t have: “Sales Engineer”.

Source Table Name

This is a valuable field for debugging. And since debugging is the king. We need it.

Marketing Channels Mapping

  • Search Organic. Dividing “Non-Brand Organic” and “Brand Organic” is good in almost all cases. An organic brand usually brings traffic only to a few pages, so the logic to detect organic brands is to check this short list of landings.
  • Move to channel mapping out of data model code because marketers need to check this and sometimes correct details.

Here is a working example of marketing channel mapping in google spreadsheet. With code for BigQuery and Clickhouse

Mapping source tables to Revenue Big Table

After all this preparation is done, GPT writes mapping itself quite well. You need to provide:

  • For a few rows of the table, you need to map
  • Revenue Big Bable data contracts.
  • Features mapping. like “opp_sum” -> “opportunity_amount_usd”. This is the only thing to check carefully because it should be coordinated across various sources.

Events From Entities. Reverse Data engineering.

First, after moving to the Activity schema idea, I felt excellent. But then, one needed for the input table isn’t about events. I did one join above, then another not events table. Finally, I realized that Joker won. And there is no value in a single data contract anymore.

But, sometime later, I decided to map entities as events.

In data engineering, it refers to the concept of Accumulating snapshots in Kimble Dimensional Modeling.

Match Sales|Revenue data from CRM with web data

Use cases when you need it:

  • Three months after the web event with traffic source, you have an event in CRM with Revenue amount.
  • Or you have many stages: Marketing Qualified, Sales Qualified, Proposal, … and need to attribute marketing channels to these events as goals.

There are two options to solve these use cases:

  1. Send some web analytics-user_id to CRM.
  2. Send CRM user_id to web-analytics

Sending web analytics-user_id to CRM

This is a much more complicated to implement case than the second one. But since this option looks more intuitive, we have to disclose it.

Pushing web analytics user_id to CRM

is how it could be implemented

  1. first; you need to get a web analytics user_id. Here is the javascript code that gets Google Analytics 4 user_id
  2. Then, you need to add this web analytics user id to the CRM event. Here is a code example for adding Google user_id as a property to the Hubspot contact

Why could it be tricky? In this particular case with HubSpot, it’s pretty straightforward. However, for Salesforce, these frontend API requests might require a customer solution that depends on specific API configuration, which might take some significant time.

Sending CRM user_id to web analytics tracker

CRM user_id to web analytics

If your user email plays the role of crm_user_id, then it might not be a good idea to push it to web analytics. I recommend a hashed email (for privacy compliance, first of all)

Here is what you need to set it up:

  1. Hash email (or another CRM user ID). Code example
  2. Send CRM user_id to the data layer. Example for WordPress gravity forms
  3. Catch user_id from the data layer. Example of doing it via Google Tag Manager and Google Analytics 4

variable set up. Go: https://tagmanager.google.com/

Google Tag Manager hashed email as a data layer variable set up

And push the email hash to ga4. In this screenshot, we named the hashed email user_id (I think it was the wrong idea).

Push hashed email to Google Analytics 4 set up.

Attribution metrics calculation

Standard requirements for marketing attribution model:

  • Attribution Window. Some old interactions don’t need to be attributed. In our case, we don’t attribute old paid visits from Facebook.
  • Exclude events after sales interaction started. For example, clicking from a call invitation email could be a noise when making decisions.

Here is the dbt models lineage that calculates attribution metrics from the revenue big table that comply the data contract

There is a google spreadsheet with all the params that need to be set up:

  • marketing_channel_features
  • marketing_and_sales_goals
  • gsheet_marketing_attribution_coded_params

Vladimir.

P.S: Like it? Please press like to my github account. It motivates me a lot.

--

--