A Glimpse into the World of Data Prep 3.0 aka Recipes!

Published in

CRM Analytics

15 min readSep 20, 2021

Recipes have been an integral part of Tableau CRM (TCRM) since the time Tableau CRM was called Wave! More recently it had a complete overhaul and is now called Data Prep 3.0! A beauty to work with! Dataflows still serve us really well but Data Prep 3.0 takes it to the next level with various types of joins, aggregations, predictions, clustering etc. in addition to what we can already do in dataflows.

Starting Parts of Data Prep 3.0/Recipe

Let’s take a look at the various parts that comprise Data Prep 3.0!

To start a Data Prep 3.0 or Recipe, we need to have access to create/manage Data Manager in Salesforce Permission Sets. Tableau CRM Admins usually have this permission set by default. To start a new Recipe, click on Data Manager in Analytics Studio -> Dataflows and Recipes in the new tab that opens -> Click on Recipe and then Create Recipes. This will open a new tab for Recipes unlike dataflows which opens in the same tab as a separate page.

Next, let’s check every part of Data Prep 3.0 that’s available starting with the menu.

When we open a new Recipe, the left side top menu looks as above.

Add Dataset: This allows us to add any dataset that exists in Tableau CRM. It can come from Salesforce Synced Objects, other datasets that exist in Tableau CRM either from csv or from other dataflows and recipes or even due to connections from other non-Salesforce databases.
Undo/Redo: Like most platforms and even Tableau CRM dashboards, this allows us to undo and redo the steps that were just performed.
Upload/Download: We can download the JSON for existing recipes and then upload it here. We can also edit the JSON of the recipe and re-upload it! At times, this saves a lot of time!
Got Feedback?: Simply provide feedback to Salesforce regarding Data Prep 3.0.

Now, let’s look at the right side top menu.

Search Nodes: Once we add nodes to the recipe, we can search them here. Very similar to how we search nodes in dataflows.
Not Saved: This part shows when the recipe was last saved.
Versioning: Data Prep 3.0 like dataflow now supports version control! We can also add a description to what changes were made to the recipe!
Save: This provides us with 2 options:
1. Save but don’t run
2. Save as a new recipe
Save and Run: This let’s us save the recipe and then run it instead of waiting for a scheduled job to run it. Very useful for ad-hoc runs or first time recipe runs.
? Help: Takes us to the help documentation from Salesforce.
Back to Data Manager: Takes us to Data Manager but on the Recipe page (no new tab is opened). Do not do this is until the recipe has been saved!

Next, let’s look at the center of this page.

This is where our Data Prep 3.0/Recipe nodes will be created and is our playground! Adding datasets, filtering, transformations, creating the final dataset everything will happen in this space!

Creating a Recipe

Adding a dataset:

We start by clicking the Add Input Data that is highlighted in Blue in the center of the page. This opens up the whole gamut of datasets that are available for use!

Let’s look at each of these parts. Most of these are self-explanatory. Starting from the top left in anti-clockwise direction.

Type of Connection: We can select what we are looking for, viz., all datasets, regular TCRM datasets including csvs or Connected Objects like Salesforce Objects or AWS or Google etc. Connected Objects.
Datasets: The whole gamut of datasets available based on Type of Connection selected.
Location: Where is the dataset stored — which TCRM app or Connected Source.
Created Date: When was the dataset first created.
Created By: Who created the dataset.
Last Modified Date: When was the dataset last modified (last successful run of the dataset).
Last Modified By: Who last modified the dataset (last successful run of the dataset). If it was a scheduled run, this will be the Integration User else the name of the person who ran the dataflow/recipe/csv manually.
Select Fields: Lists all available fields that can be used in the recipe.
Search Columns: Search the fields that are needed. Instead of scrolling through hundreds of fields, this let’s us search for the fields we are looking for (Label names not API Names when available).
Sample Size: Number of rows we want to see the preview for. By default it is 2000 but can be 5000 or 10000 rows. Once the dataset is selected, right beside this, the total number of rows available in the dataset is also shown for datasets already in TCRM (non connected sources).
Search data by Name: Search the dataset we want by name. Helps when there are too many datasets available.

Let’s select the Trips__c dataset from SFDC_Local (Salesforce Local Connection) and select the fields. Once selected, click next. The dataset will be added to the recipe.

This is what a dataset icon looks like in Data Prep 3.0/Recipe. Click on the icon and we will find an Input window open up.

Again starting from the Top left in anti-clockwise direction:

Change Node Name: Click on the pencil icon next to the Node Name. This is very useful when we have multiple joins or transformations to explain what each of these are. We can also add a description for the node.

Left pane: This shows the dataset details like the label name of the dataset followed by API Name, Location, Created Date and Modified Date. There is a Replace link next to the Dataset name. This lets us change the dataset, if needed or add/remove fields from the dataset. This opens up the first Add dataset pane that we saw above. Note: Replacing the input data can affect downstream nodes in the recipe.
Previewed Data: The preview shows the data and the names of the columns associated. This helps to see what type of data is stored in the fields.
Expand/Collapse: This window initially opens in the bottom half of the screen. But clicking on that icon makes the window expand to full screen. Clicking on that again, will bring it back to the bottom half of the screen.
Column Details: Next to the Expand/Collapse is the Column Details. This shows the details of the column profile for the sample data like Valid Values, Missing Values, Zeros and a histogram of the values. Attributes on the other hand shows the API Name of the column/field, the type of column and precision/scale when applicable.

Hide columns from Preview: Select a column and click this icon to hide a column from preview. This does not mean that the column will not be part of the dataset. It will still be. We will not just see the column in the preview.
Columns: Right next to Preview on top is the Column. This shows the column/field details in a list format without the preview of the data. This includes Field Name, API Name, Type of Column and the Source of the Column (This is useful to understand where the column came from once we start joining and transforming the dataset/recipe).

Clicking on the empty white space takes us back to the recipe design page (closes this window).

The annotation on top of the above image indicates that there is a description associated with the node. This comes from the part when we edited the name and description of the Input Node above.

Joining datasets:

Now that we have successfully added a dataset. We will join another dataset. In this case, Taxi__c. To do this, click on the plus sign next to the Trips dataset.

Click Join from the dropdown. This will take us back to the Add dataset page. Select Taxi__c dataset and follow the same method as if adding a new dataset to the recipe. The difference here is that on clicking Next, we see a slightly different window open up which shows the types of joins available.

Let’s break this down one at a time starting with the left pane.

This shows us that Trips is joining to Taxi__c using a Lookup. This works for us because one Trip can lookup to only one taxi. The difference between the various joins are listed below:

Types of Joins:Lookup: Lookup all rows from recipe data (Trips) and only matching rows from Taxi__c (the newly added dataset). If multiple rows are found, return one matching row. This is useful when we know that the relationship between the two datasets is one to one. Left: Include all rows from recipe data (Trips) and only matching rows from Taxi__c (the newly added dataset). Include all matching rows when multiple records found. This is useful when we know that the relationship between the two datasets is one to many. Note: This may cause data explosion! Right: Include all rows from Taxi__c (the newly added dataset) and only matching rows from recipe data (Trips). Include all matching rows when multiple records found. This is useful when we know that the relationship between the two datasets is many to one. Note: This may cause data explosion!Inner: Include only matching rows from the recipe data (Trips) and Taxi__c (the newly added dataset). Include all matching rows when multiple records found. This is useful when we want only the rows that find a match in both datasets. If there are rows on the left (Trips) that don't have a matching record for the right dataset (Taxi__c), then those rows will fall off. Note: This may cause data explosion!Outer: Include all rows from recipe data (Trips) and Taxi__c (the newly added dataset), regardless of whether they have matches. Include all matching rows when multiple matches found. This is useful when we want all the rows from both datasets. If there are rows on the left (Trips) that don't have a matching record for the right dataset (Taxi__c), then those rows will remain in the final dataset. Note: This may cause data explosion!

Join keys: We will use Taxi from the Trips dataset and Record Id from the Taxi dataset as the join keys. When needed we can add multiple keys!

API Name Prefix for Right Columns: The new columns have the API Name that is in the left pane associated with the columns now. We can update the name of the API to our liking. We choose to call the API Name Taxi instead of Taxi__c. Thus the right columns are now called Taxi.Id etc. And the Source part shows us that these columns have column from the Salesforce Local Taxi Object.

Look Up Multiple Values: Returns and consolidates all matched rows if multiple matches found. This does not cause a data explosion! And is very useful when dealing with security predicates!

Once done. Click Apply. Now we have successfully joined 2 datasets!

Note: The red circle with a cross on the Join node means that the recipe will not run because there is an error. In this case, we don’t yet have an output node. Thus we see this symbol.

Transform Node:

Though this needs a blogpost of it’s own, we will get a gist of what a transform node is here.

Transform node allows us to add a new formula column to the dataset (ComputeExpression in a dataflow)! It also allows us to do multiple relative row calculations (ComputeRelative in a dataflow). We can also rename an API or label, bucket, flatten, clustering, sentimental analysis, pivots and a lot more!

To create a transform node, we again click on the + sign next to the node we want to add the node to. In our case, it is the joined node and click Transform. Again a window opens up and we see the following:

The very top shows all the types of transformations that can be done and this changes when the type of column changes. The above image is for dimensions. Click on Start Time and we see these transforms only.

And for measures, this is the transform node. Click on Miles.

Now click on fx is the formula field to check the huge gamut of transformations we can create. The ones on the top like Floor, Bucket etc. are shortcuts if that is what we want to do. fx also shows DateTime and Now functions. But, we will select custom from this which opens a new left side panel for custom formulas.

Again starting from the top left in anti-clockwise direction:

Function: Various functions available to us to create a new column. Select a function, click the plus sign. This adds the function to the right side of the formula pane along with a snippet on what all parameters are required to use this function.
Columns: Right next to Function is Columns. Similar to Function, we can search for a column/field and add the column using the + sign next to it.
Output Type: This is the type of Output required — Text, Number, Date, DateTime.
Default Value: This is optional but can be filled up with a default value that matches the Output Type when needed.
Column Label: This is the name of the new column that will be created.
Multiple Row Formula: This opens up lag/lead etc. for ComputeRelative type of functions.

Let’s create a formula now with Label Name: Type of Distance

Note that Recipes use SQL as opposed to SAQL which is used in dataflows. Thus the following points should be taken into account:

Texts are surrounded by single quotes NOT double
We use single = instead of ==
If the API name is a join object like Taxi.Id, then Taxi.Id should be enclosed in double quotes (“Taxi.Id”) NOT single quotes (‘Taxi.Id’).

Once done, click apply and a new column will be created.

Once a column has been created, click on the column and select Edit Attributes to change the API Name of the column. Note that this should be done only after all derived columns have been created.

This opens another window where we can update the name of the field/column and the API Name

Filter Node:

Again, click on the + sign next to the Transform Node and select Filter. We will filter out anything that says, “No Distance Provided”.

Click on the + sign below the Filters on the left side. Select Type of Distance from the Columns list. Then select Operator as does not equal and then value as “No Distance Provided” from the dropdown. Note that this list is based on the 2000 rows of sample data. So, if we don’t see any value that should be present, just Enter the text in the “Enter a text value” to manually enter the value.

We don’t have “No Distance Provided” anymore in this dataset

Note that after adding each node, we have to click apply for the node to be added to the recipe.

Aggregate Node:

Click on the + sign next to the Filter node and select Aggregate.

For the left pane, starting at the top, we add the measure we want to aggregate by. Then we add the dimensions we want to group the rows by and ultimately (optional) we can add the dimensions we want to group the columns by.

In this case we will aggregate the total fare collected by each Taxi.

Then we group by the Taxi Id.

And we want to Group the columns by Type of Distance.

But, once we aggregate the data, we lose all other details that we may want to see. Thus, we can simply sum or average all measures and group the rows by all relevant columns.

Aggregation after adding more measures and groups

This is the list of columns that we created in the aggregate node

We can add more join nodes or transform nodes after this as well. In this case we join the Taxi Affiliation to get the name of the Company.

Append:

We are not appending more data to this. But, if we had to, this is the node to use. To do this, select the + sign next to the node we want to append the data to. Then select the dataset and fields. Once done, we will be asked to match which columns are to be appended to which existing column. The column names in the two objects can be different and we match them here. The only thing to keep in mind is the type of column must match! We cannot match a Date column to a DateTime column or a Measure column to a Dimension column.

Discovery Predict: This also needs a blog of its own and here is a great one!

Output Node:

Click on the + sign next to the aggregate node. This again opens a window.

For the left pane of the Output Node starting from the top:

Write To: Where do we want to write the recipe to? Dataset in TCRM or an Output Connection like Salesforce or Snowflake or a csv.
Dataset Name: Name of the dataset. This changes to Connection Name when we choose Output Connection. And for csv, it saves as a CSV to the Salesforce database which can then be downloaded later using the public API.
API Name: All datasets in TCRM need an API Name. This should ideally be same as the dataset name for ease of use later.
App to Save to: Where do we want to save the dataset to in TCRM?
SF Sharing: If we use Salesforce Sharing then this will be used.
Security Predicate: Add the security predicate for row level security here.