EA Certification Study Guide Part 1: Data Layer

6 min readApr 13, 2019

This portion of the study guide will cover resources and notes for the Data Layer section of the Einstein Analytics and Discovery Consultant exam.

Source: Einstein Analytics Tech Lounge, 4/4/2019

Sources:

Trailhead Badge
-Analytics Data Integration Basics
Help documentation
-Data Integration Guide
Learning Map
-Connect External Data
-Dataflow or Recipe?
-Get Started with Recipe Editor
-Add Data in a Recipe
-Analytics REST API Overview
Documentation in Exam Guide (that wasn’t already covered above)
-Create and Run More Dataflows
-Avoid Data Drift with Periodic Full Sync
-Sync Salesforce Data Incrementally in Data Sync
Einstein Analytics Training Videos
-Video 1: Basic Terms
-Video 2: Building Datasets
-Video 3: Data Preparation with Recipes
External Resources
-My Most Used computeExpression
-Flag Nulls in Your Datasets
-Einstein Analytics Recipes: Second Round
-Let’s Play Salesforce: Dataflow Basics Part 1, Dataflow Basics Part 2
-Einstein Analytics: Add and Append in a Recipe
-Einstein Analytics: Using Append Transformation
-Einstein Analytics: Filter Dataset Across Object
-Einstein Analytics: One-to-Many Relationship with the Power of 1
-Einstein Analytics: computeRelative Getting Started
-Einstein Analytics: Filter Syntax
-Data Integration via the Force.com and Wave Analytics APIs
-Relative success : Using Relative Compute Nodes to Process First and Last Values and Count Rows

Notes:

*I recommend you read/watch each of the sources and take your own notes. My summaries below are not inclusive of what will appear on the exam.

Bring Data into Analytics

There are multiple ways to bring data into Einstein Analytics:

Direct Salesforce connection
— All Salesforce data syncs are run incrementally by default but incremental sync isn’t supported for all Salesforce objects (Ex. Profile, User)
— When you remove an object or field from all dataflows it’s sync is not automatically turned off, but when you add a field to a dataflow its added to the sync
CSV file uploads
— Not automated but there is a tool on Github to fake this
External Data Connectors, options include:
— Data warehouse connectors like Amazon Redshift or Google BigQuery
— Enterprise Application connectors like Netsuite or SF Marketing Cloud
— Data Services connectors like Heroku Postgres or AWS RDS
— File-Based Data Source like Amazon S3
— Website Analytics connectors like Google Analytics
— Excel 2013 connector (null handling not supported)
External data API capabilities
— Create/update/query: Datasets, lenses, XMD information, Analytics apps, dashboards, sharing inheritance, user feature availability, data recipes, ‘eclair’ geo map charts
— Manipulate replicated datasets and work with data connectors
— Based on the Chatter REST API

Create Datasets and Prepare Data

Once data is pulled into Einstein Analytics through connections, you combine this data and prepare it through Dataflows or Data Recipes.
— Dataflow: Set of instructions that specifies what data to extract from Salesforce objects or other datasets
— Data recipe: A more user-friendly interface that allows you to combine and prepare your data. Some limitations however
Regardless of whether you use a dataflow or a data recipe, your end result will be a dataset
— Dataset — Collection of related data that is stored in a denormalized, yet highly compressed form.
— Types of fields in the dataset: Date, Dimension (qualitative value), Measure (quantitative value)

Other Terminology:

JSON — Javascript Object Notation, set of instructions that run to extract data and create datasets.
— The language used to describe the actions performed in the data flow.
Root object — lowest object in the hierarchy you are extracting.
Dataset builder — interface that allows you to load Salesforce data in a structured manner

Dataflow transformations

Append: Combine rows from multiple datasets into a single dataset
Augment: Add columns from another related dataset.
— If there is a null is the left dataset key, EA appends the right columns and inserts a null for the dimensions and 0 for the measures
— Will also match empty left and right keys, but if the left key isn’t unique EA will create one column per appearance of the key. Can utilize LookupSingleValue to get rid of the duplication but the matching row is randomly selected.
— LookupMultiValue will return results from all matched values into a comma-separated field. *Never use on date fields, will cause odd results
— If no match is found for the left key, the left columns will be populated the right null
computeExpression: Add derived fields to a dataset (using a SAQL expression). Use cases:
— Create a custom sorting on Stage Name, bucket fields together, rename field values
— Create a “Power of 1” solution within EA if you don’t have permissions inside Salesforce
computerRelative — Used to analyze trends by adding derived fields that are based on other rows in a source. Use case:
— Calculate number of days an Opportunity remains in each stage
Delta: Calculates changes in the value of a measure column over a period of time
Digest: Extracts synced and connected data in a dataflow
dim2mea: Creates a new measure based on a dimension and adds it to a new column while preserving the dimension
Edgemart: Connect dataflow to an existing dataset, use this to reference it then augment/transform from there
Export: Creates a data file and a schema file from the data. After running Einstein Discovery users can access these files through the public API.
Filter: Removed records from an existing dataset (can also be done inside sfdcDigest).
— Has a special syntax or can opt to use SAQL
sfdcDigest: Generates a dataset based on a Salesforce object; runs a SOQL query and is thus subject to SOQL limits.
— Can apply a filter using JSON or SOQL to improve performance
— Certain objects and fields are unsupported in EA. Ex: LoginHistory
sfdcRegister: Registers a dataset to make it available for queries.
sliceDataset: Removes fields from a dataset in your dataflow leaving you with a subset to use in a new dataset of transformations.
— Allows you to create multiple datasets in the same flow with different fields
— Can also hide fields with XMD but using sliceDataset will improve performance
Update: Updates a specified field values in an existing dataset based on data from another dataset (i.e., the lookup dataset)
Overriding metadata generating by a transformation: Can optionally override object and field attributes so that is appears different in the dataset.

Data Recipe Transformations

Key features of recipes (all can be done in dataflows as well)
— Combining data sources with composite keys
— Bucketing fields; especially helpful with relative ranges
— Customize the label of fields (and API names of recipe generated fields)
Add vs. Append
— Add: Vlookup style, more columns but no additional rows
— Append: More rows, no additional columns
Dataflow or data recipe?
— Flow best for predictable results, use to combine multiple data connections
— Recipes better for single source datasets where smart data cleaning is required

Special Scenarios:

Null Handling — Can be done in a data flow or JSON
— When grouping by a field that has nulls, EA remove them from the results by default; not an ideal solution!
— Dataflow solution: Use computeExpression and a case statement to replace null values
— JSON solution: Rather than adding extra nodes you can add a default value to the JSON field parameters. Example:

Date Handling — EA automatically generates extra fields for each date in the dataset to allow for extra calculations.
— Example: Create Date will become Create_Date_Week, Create_Date_Year, Create_Date_Fiscal, etc.
— Can configure the dataflow to handle fiscal periods that differ from the calendar year by setting the fiscalMonthOffset attribute to a value other than 0
— Can also set which fiscal year to use when the calendar changes using the isYearEndFiscalYear attribute
— Week Number is defaulted to start each week on Sunday but can be changed if necessary

Dataflow Limits

Total number of rows from all registered datasets cannot exceed 100 million rows per platform license
Can run a maximum of 60 dataflow jobs within a rolling 24-hour period
Dataflow subtab inside Monitor contains the last 7 days of dataflow jobs
Can have up to 35 separate dataflows
Datasets can contain a maximum of 5,000 fields
Max of 5 trended datasets per user, 100,000 rows per snapshot, 5M in entire dataset, 40M across all snapshots
Can sync a maximum of 100 objects
Max file size from External Data API is 40 GB, no more than 50 external jobs can run in a rolling 24-hour period

Next: Part 2 — Dashboard Design