Understanding Dataflows
Dataflows! Dataflows!
Let me start with saying that I feel lucky and more empowered since Salesforce has decided to increase the number of rows that we get in Einstein Analytics from 100 million to 1 billion!! Woohoo! Good times! Especially if you work with orgs that have many datasets created for various purposes and each run into millions of rows and some cross 100 million!
Dataflow is a good nifty tool if you know what you want and how you want to do it. It will save you a lot of time writing bindings and SAQLs in Einstein Analytics (EA), when Dataflow is done right. Of course there are multiple times when you will still need to join multiple datasets and write multiple SAQLs and Bindings in the dashboard. Until the increase in number of rows was announced, I tried NOT to create more datasets and work with more bindings and SAQLs. With this change, now I create datasets when I find writing a SAQL will be more than a days job while the dataflow is maybe a million rows and can be created in less than an hour!
The first things to know about dataflows:
- What are the different things we can do when creating a dataflow?
- What are the steps involved or is there a flow we should follow?
- What will break the dataflow or our lens and dashboards?
What are the different things we can do when creating a dataflow?
Each of the violet / purple boxes is called a “Node”. Each of these nodes when combined together forms a dataset (or a register). The nodes are connected to each other by grayish lines or Flow Lines (In traditional Process Flows, flow lines have arrows on one or both sides representing where the data is flowing from and to. We don’t have that yet in Einstein Analytics.). Let’s move on to the different types of nodes that are possible in the dataflow:
Some of the names in the image above (Parts of a Dataflow) are self evident like Name of the Dataflow, Run Dataflow (Changes to Update Dataflow when you change anything in the dataflow). Let’s break each of the pieces of the puzzle above and see what we get.
- Name of the Dataflow: When we start creating a dataflow either using the Dataset Builder or from Create Dataflows in the Data Manager, we have to give the dataflow a name. And this name is not editable! :( We can create multiple datasets within a single dataflow.
- Dataset Builder: We can use this tool to create your datasets. The benefit of using this is that we don’t have to think which object can be joined with which object and what can be the join keys. It shows us most everything.
- Digest Data: As we can see, there are 3 ways of digesting data into Einstein Analytics via the dataflow, viz., sfdcDigest, Digest and Edgemart. sfdcDigest as the name suggests is used when we want to get data from Salesforce objects. Digest, on the other hand is used to get data from API connections or connectors that already exist in EA. Last but not the least Edgemart is used when you want to use an already created dataset, either in another dataflow or using a csv file.
- Append / Augment: Append is basically adding rows of data to an existing node while Augment or Join adds columns to the data. Let’s think of it in this way, when we have 2 csv files with same column or field names and we want to merge them together, we call it appending the data. Thus Append to the existing data.
- Augment: On the other hand when we want to join 2 csv files with a column field and get all the relevant information in both files to a single one using vLookUp, we call that augment or join. When we augment a dataflow, we need 2 nodes — the left source (what we want to join to) and the right source — what we are joining. Each of the left and right sources (nodes) have their own keys called the left key and right key respectively. Relationship is where we put the name of the join which forms a part of the dataset. Say, ‘oppty-accounts.Amount__c’ etc. where ‘oppty-accounts’ is the relationship name for that node and ‘Amount__c’ is the field in that node. We can give whatever name we like but I personally like to follow the left source-right source convention even if I shorten the names. Right fields — When we do a vLookUp in Excel, we don’t have to pull in the fields from the sheet where we are performing the join. But, we have to mention all field names which we want to see from the sheet we are performing the lookup to. Similar to this, we don’t have to select the fields on the Left Source but we have to for the Right Source. Lastly, the Operation — Is it a Single Value LookUp or Multi Value LookUp. Try Single Value LookUp as much as possible. In some cases like security, when that is not possible, we use Multi Value LookUp. The caveat to Multi Value LookUp is, if there are dates on both sources, it fails. Also, important to note that all augments in the current dataflow are left joins only!
- Transformations: Transformations contains computeExpression, computeRelative, dim2mea and flatten.
- computeExpression: We use this to create a new node and calculate or create flags. Say, we use a field in our dashboard multiple times and it includes using 3 other fields. We can perform that calculation here and save time in creating and loading the dashboard. You can create text, numeric and date fields. Merge it with source (Image: computeExpression (Part 1)) helps to retain all fields that are present in the flow (yet to be dataset) till that point. The Name of the Field can be anything that we choose and the Label is auto populated from the Name. No spaces or “-” allowed in the Label. We can use simple SAQL Expressions to create our formula field. It can be as simple as just the number “1” or (amount1__c + amount2__c + amount3__c) or (case when amount1__c > 0 then “abc” else “xyz” end). When we use numbers, precision and scale is important and should always be filled in. Else the dataflow run will fail. Default value on the other hand is our choice.
- computeRelative: This is similar in nature to computeExpression but calculations are done based on the same field or column but different rows — particularly the current, first, previous, or next rows. Using this, we can calculate how the value or amount changed between each row based on how we sort and partition the records. We select the source, the field we want to partition by and then sort either ascending or descending (This will be important when we use the Offset function). Then like computeExpression, we add a field, give it a name, label, choose if we want to use SAQL or an existing Source Field, Offset Function and a default value. Offset Function is where we say give us the value relative to what? Current row, next row, first row or the previous row.
- dim2mea: When we want to change a dimension to a measure, this is an easy tool to use. Say, we have years that are being brought in as text but we want that to be a numeric field as well to calculate the number of years from a certain year. This is very easy to do compared to the other transformations. All we need is the node name, source node, dimension (as in the text field in the source), measure (name of the numeric field), default value if any and the measure type (this can currently be only long). That’s it!
- Flatten: I am not going to say anything about this here because I never had a situation where I had to use it yet. So, I will guide you to the Salesforce documentation instead.
- Filter: This is a common one and can be used right after the digest node (Filters in the digest node can be done but one needs to be careful when trying to do that.), or any node for that matter but before we register the dataset for obvious reasons!
- Another easy one to do. Give the node a name, find the source node, use SAQL checked if you want to use that and then just add the SAQL filter expression. In this case we can use an “AND” “OR” statement as well. When using “OR” both sides of OR have to be within parenthesis else, the dataflow fails. If we don’t use SAQL, it goes back to the old way of filtering in the dataflow which is presented here.
- Slice Dataset: This allows us to keep or drop fields based on what we want in the final dataset. That way we can create multiple datasets with different fields if required. Ideally, we use “Keep” if we want a very small subset of the fields in the dataflow and “Drop” when dropping a few fields will be easier than adding all the fields we want in the dataflow.
- Update: I will not talk much about this either because I haven’t had a used case yet. But, it isn’t uncommon in some cases where you might need to update the value of a field by a different one. This is more common when you are using csv files. More info can be found here.
- Register: Now that our dataflow has been created, we need to register it for us to be able to use it in lens and dashboards. It is also here that we add the security predicate if we have any. As usual with any node, we need to name the register node, the last source node of the dataflow that we will use to register the dataset, alias and name (these are the name of the dataset that we will see in the Einstein Analytics UI), Sharing Source, if we have any and the Security Predicate if we have.
- Export: This is for use in Einstein Discovery and I will not be talking about it here. Einstein Discovery needs it’s own blog. :)
What are the steps involved or is there a flow we should follow?
Having looked at what each node does, the minimum things to create a simple dataflow are:
- Digest the data
- Filter — may or may not be required
- Augment the digest nodes — very important to make the right joins to get the right data.
- computeExpression / computeRelative as the case maybe
- Slice the dataset if required
- Register!
- Update the dataflow
- Run!!
That’s it. We are done! Once we run the dataflow, we see if there are any warnings or failures. Solve if there are any problems and we are ready to start creating dashboards and lens.
What will break the dataflow or our lens and dashboards?
- Changing the augments (left source becomes right and vice-versa).
- Changing the relationship name in the augment nodes
- Adding a new node to an existing dataflow. The way to not break this is to add the node just before we register the dataset, if possible.
- If we lose access to any field due to deleting or change of API Name of the field from Salesforce or the ingest data.
- Propagate / Don’t propagate the field correctly.
- Wrong computeExpression / computeRelative
- Security Predicate not updated when there is one.
At the End:
Now that we have seen all the various digestions, transformations and other features of the dataflow, I would like to say that this is a very general rundown of various parts and my way of explaining it to make it easier for everyone without too much of jargons. The best way to learn is play around with it and use the resources out there like the videos of Peter Lyon and others.