Let’s FILL in some date gaps!
Einstein Analytics has many extremely useful functions (f(ₓ)) that we should all be aware of and use at the right time. Some of the common ones are:
- abs()
- avg()
- coalesce()
- count()
- max()
- min()
- number_to_string()
- round()
- sum()
- Date functions
Some of the other not so common but awesome ones are:
- ceil()
- fill()
- first()
- floor()
- len()
- lower()
- substr()
- upper()
Each of these can form its own post! But in this one, we will learn everything there is to know about fill(). When working with data, one of the issues we face at times is missing data or data/date gaps. Even the end of the year or end of quarter data gaps just before the end of the fiscal time period is a date gap. Most of the time, we are asked to fill these gaps with the previous year or dummy value that we or the person responsible for the final numbers wants to fill in. How about, instead of letting someone, a human, decide what the number should be, we let the Einstein Analytics platform decide it for us? How cool would it be to let our users know that this value didn’t come out of our head but there is data and seasonality and a lot of statistics involved behind it?!
fill() adds the missing months in the selected time-period with nulls. One point to note here is, that fill() works only with missing dates. You cannot ask for an Account to be filled with this functionality. On the other hand, time-series fills in the missing values for the missing dates and predicts the future. We can choose to not predict the future and use only to fill in gaps. This is also called estimations. An important point to remember is that estimations and predictions are as good as our data is. If our data is messy, we will get bad estimations and predictions. Hence it is necessary to use good/clean data as much as possible before we estimate the missing dates or predict the future. We will take up time-series in the next blog-post. Meanwhile you can refer time-series doc in Salesforce or this blog by Rikke Hovgaard.
How to fill()?
We will use our taxi data to fill in the gaps with nulls so that when viewing in a chart, the dates remain! This is how the data looks for 2 Taxi Companies on a chart with missing dates. We can see that there is no data for 2017–01–02 and 2017–01–03 for the first company and 2017–01–02 to 2017–01–04 for the second.
Now, all we need to do is turn this query to SAQL and use the fill() function to add in the missing dates! We can use this method for any measure field. In order to keep it simple, we will use count of rows (count()).
Turning the query from compact mode to SAQL, turns into the following:
q = load "Taxi_Trips";
q = filter q by 'Company' in ["1085 - 72312 N and W Cab Co", "3141 - 87803 Zip Cab"];
q = filter q by date('Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day') in [dateRange([2017,1,1], [2017,1,10])];
q = group q by ('Company', 'Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day');
q = foreach q generate 'Company' as 'Company', 'Trip_Start_Timestamp_Year' + "~~~" + 'Trip_Start_Timestamp_Month' + "~~~" + 'Trip_Start_Timestamp_Day' as 'Trip_Start_Timestamp_Year~~~Trip_Start_Timestamp_Month~~~Trip_Start_Timestamp_Day', count() as 'count';
q = order q by ('Company' asc, 'Trip_Start_Timestamp_Year~~~Trip_Start_Timestamp_Month~~~Trip_Start_Timestamp_Day' asc);
q = limit q 2000;
We start with breaking the date format further down so that instead of YYYY-MM-DD it shows as YYYY, MM, DD.
q = load "Taxi_Trips";
q = filter q by 'Company' in ["1085 - 72312 N and W Cab Co", "3141 - 87803 Zip Cab"];
q = filter q by date('Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day') in [dateRange([2017,1,1], [2017,1,10])];
q = group q by ('Company', 'Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day');
q = foreach q generate 'Company' as 'Company', 'Trip_Start_Timestamp_Year' as 'Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month' as 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day' as 'Trip_Start_Timestamp_Day', count() as 'count';
-- q = order q by ('Company' asc, 'Trip_Start_Timestamp_Year~~~Trip_Start_Timestamp_Month~~~Trip_Start_Timestamp_Day' asc);
q = limit q 2000;
Note that we are not deleting the order statement because we will use this in a later query. We have just commented it out for now.
Now, we add in the fill() function after the foreach/aggregate statement.
q = load "Taxi_Trips";
q = filter q by 'Company' in ["1085 - 72312 N and W Cab Co", "3141 - 87803 Zip Cab"];
q = filter q by date('Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day') in [dateRange([2017,1,1], [2017,1,10])];
q = group q by ('Company', 'Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day');
q = foreach q generate 'Company' as 'Company', 'Trip_Start_Timestamp_Year' as 'Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month' as 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day' as 'Trip_Start_Timestamp_Day', count() as 'count';
q = fill q by (dateCols=('Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day', "Y-M-D"), partition='Company');
-- q = order q by ('Company' asc, 'Trip_Start_Timestamp_Year~~~Trip_Start_Timestamp_Month~~~Trip_Start_Timestamp_Day' asc);
q = limit q 2000;
Breaking down the fill() statement:
- We start with the command fill. This is a standalone statement just like foreach, order etc.
- datecols is written in the format (Date_Year, Date_Month, Date_Day, “Y-M-D”). This format can be changed to (Date_Year, Date_Month, “Y-M”) or (Date_Year, Date_Quarter, “Y-Q”) or simply (Date_Year, “Y”).
- partition: The final piece of the puzzle! The data in this case is partitioned by ‘Company’. There can be only one dimension column/field that can be used for this purpose. In order to circumvent this, we can contenate the various dimension columns to form one column and use that as our patition.
Notice the null rows in the above image. That’s what fill() does!
Now, we will change the date format back to YYYY-MM-DD. To do that we just use the foreach statement that Einstein Analytics uses.
q = load "Taxi_Trips";
q = filter q by 'Company' in ["1085 - 72312 N and W Cab Co", "3141 - 87803 Zip Cab"];
q = filter q by date('Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day') in [dateRange([2017,1,1], [2017,1,10])];
q = group q by ('Company', 'Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day');
q = foreach q generate 'Company' as 'Company', 'Trip_Start_Timestamp_Year' as 'Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month' as 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day' as 'Trip_Start_Timestamp_Day', count() as 'count';
q = fill q by (dateCols=('Trip_Start_Timestamp_Year', 'Trip_Start_Timestamp_Month', 'Trip_Start_Timestamp_Day', "Y-M-D"), partition='Company');
q = foreach q generate 'Company', 'Trip_Start_Timestamp_Year' + "~~~" + 'Trip_Start_Timestamp_Month' + "~~~" + 'Trip_Start_Timestamp_Day' as 'Trip_Start_Timestamp_Year~~~Trip_Start_Timestamp_Month~~~Trip_Start_Timestamp_Day', 'count';
q = order q by ('Company' asc, 'Trip_Start_Timestamp_Year~~~Trip_Start_Timestamp_Month~~~Trip_Start_Timestamp_Day' asc);
q = limit q 2000;
As we can see, we are just using the same columns and except for the date field, we are just pulling in the columns from the first foreach statement. And then we open up the order statement which we had earlier commented out.
Even after we join the various date columns into one, we still have the rows that didn’t have data as null! This is the beauty of fill().
Now, all we do is change this back to the chart. And viola! This is how our final chart looks like.
Final Note
This is an extremely useful tool for users you want to see the data for all months irrespective of there being any data present! It helps them to visualize what they are missing easily and either fill it in with relevant data or ask the Einstein Analytics developers to use time-series to fill in those gaps!
Reference on fill: