Pre-ordering and filtering using the Google Analytics API — a starter guide
Chapter 4 of our Google Analytics API Guide for Absolute Beginners
Welcome back everyone to another installment of our “Google Analytics Reporting API for Beginners”.
In the previous installment of this series we ended our discussion at few basic parameters of the Google Analytics Reporting api request, such as setting how many entries you want to return, what is your view id, and such.
Today, we are going to discuss two important fields in Google Analytics API that deserves an in-depth look to get full business value from it.
- How to pre-order your results so it returns in a format to your liking.
- How filtering works in the reporting API and how to use metric and dimension filters to exclude unnecessary traffic.
With today’s article, we will have covered all basic and intermediate usage of the Google Analytics API, leaving only advanced usages such as Pivoting, Cohort Analysis, and Segmentation.
Those three areas are going to receive their dedicated attention from us since it require not only explanation of techniques and ways of request, but also theoretical data analytics concepts on how you can apply them to your business.
We will jump right into where we left off last time, so if you are unfamiliar with our structure or need some primer, I recommend reviewing the last article first (linked below).
Let’s begin!
How to pre-order your data in your request
Parameter: orderBys
Input: a list of orderBy objects
Sorting your data is a fast and easy way to identify pages or/and channels that contribute the most and the least to your business.
It can also make visualization you produce from your Google Analytics data much more appealing and easy to read for non-tech savvy users.
While you can technically sort and clean up your data locally after you get them from your Google Analytics API, Google Analytics offers you an option to sort the data according to specific columns automatically via the orderBys field.
This feature is particularly helpful when you are trying to consume (technical term for using) data directly in your front-end dashboard as they are returned from the Google Analytics server without any additional manipulation — or simply if you are not comfortable, or want to save time of an additional data cleaning step after you get your data.
The parameter require your to input a list of orderBy objects displayed below.
{
“fieldName”: string,
“orderType”: enum(OrderType),
“sortOrder”: enum(SortOrder),
}
fieldName is the column of relevance that you want to sort your data on, this could be a metric or dimension column explained in the previous article.
The orderType tells Google Analytics what kind of value you want to sort your column on, and can be one of the following:
- ORDER_TYPE_UNSPECIFIED: will be defaulted as value, this is the default if this field is none,
- VALUE: based on the value of that column, such as session numbers for ga:sessions metric, this is the default option you usually want to use when sorting,
- DELTA: if there are two date ranges in the request (remember that you can submit up to date ranges in a single request?), you can let Google sort a column (such as ga:session) by the differences of those two date ranges. This is particularly helpful when you are trying to identify major drop-offs in your traffic comparing two periods (note: only compare weekly or monthly periods, or else you result will be greatly influenced by seasonality).
- SMART: this is another uncommon order type that only apply for metric columns that represent a ratio, such as “ga:bounceRate”. I am not going into details about how this is formulated, but just know that this sorting method is designed to weight large ratios with large denominators and numerators behind them more than ratios with less denominators and numerators behind them (checkout the graphic below for this illustration)
- HISTOGRAM_BUCKET: recall that when you are selecting metrics, you can add an option for “histogram bucket” to convert continuous numbers such as sessions into smaller chunks such as “0–10, 20+, 10–20”. By default, those buckets are sorted in dictionary order such as “0–10, 10–20, 110–200, 20–30, 200–300”, and you can turn them into normal sorting method “0–10, 10–20, 20–30 etc.” by using this sorting option.
- DIMENSION_AS_INTEGER: In most cases, dimensions can be sorted directly using the delta method. However, some dimensions, such as id of a user, are not recognized as a numeric value by GA, but you can nevertheless sort it with this option here.
After going through all potential sort types, let’s explore the last field of this parameter — sortOrder.
This field is rather simple and only contain two values (excluding SORT_ORDER_UNSPECIFIED, which is you will never physically put in anyway): ASCENDING AND DESCENDING.
I know you probably know what this means, but just for good measure, ASCENDING means that you want the lowest number (whatever ordering method you choose) to be the first record, and highest to be the last record.
Whereas for descending order, you want the exact opposite to be the case.
Finally, GA accepts a list of the orderBy type in its request, and the order you place your orderBy objects in your list will actually impact what kind of result you get.
This is because GA will always apply those sortBy commands in order, and discard all later sorting rules if conflict between rules occurs, resulting in the earlier entries on the list receiving larger priority than items last on the list.
For example, if you are doing default value sort for your date, bounceRate, and then sessions, your data will only be sorted according to date because GA will find later sortBy objects (that of bounceRate) in conflict the first sorting rule (if you have only unique dates), and will therefore ignore the rule that comes later.
How to filter your data in your request
Filtering is another data analytics technique that will come in incredibly handy when analyzing your internet traffic data.
For example, when you are analyzing your data from the referral channel, you will see that majority of your rows are only going to contain 1 or 2 sessions from websites that you probably have never heard of.
To make sure that you are only analyzing data that are relevant, you probably want to only include the referral source with over 10 sessions, and the filtering function of the Google Analytics API can help you do just that.
Very confusingly, Google Analytics provide you with two ways to do the exact same operation described above, either through a combination of the parameter metricFilterClauses and dimensionFilterClauses, or simply the filtersExpression parameter, which uses a filter string to do the same job.
Don’t ask me why they do it this way because I have no idea. I can only speculate that they were using the string method in their previous API version 3.0, and just didn’t bother removing it since it wasn’t entirely broken.
Anyhow, let’s go over both ways for both measures, and you can decide which one you want to use.
Way 1: metricFilterClauses and dimensionFilterClauses
We are going to start with the new way of filtering introduced in the new version 4 of the reporting API.
While this method takes a lot of explaining and understanding, personally I believe it is a lot more powerful than the old way they have been doing filtering, which will be explained as way 2.
One potential confusion when seeing the name of the two parameters is that you may think we are removing both columns (metrics) and dimensions (rows) from your dataset.
However, both parameters are in fact only used for removing rows from your datasets only.
The only differences between the two parameters that the dimensionFilterClauses remove rows from your dataset purely based on the attribute of the dimensions, such as its name. Such removal is happened before any data are compiled, so the total number you get from Google Analytics will be the total number of metrics after such filtering have occurred.
On the other hand, the metricFilterClauses remove rows from your dataset based on the metric measure of that row, and can be used for removing dimensions that have too little sessions or pageviews. However, this filtering happens after data have been aggregated, so one outcome is that the “total” statistics your response will be different than the sum of all your rows.
Now let’s have a deep look into both parameters.
dimensionFilterClauses
Input: list of dimensionFilterClause Object
{
“operator”: enum(FilterLogicalOperator),
“filters”: [{object(DimensionFilter)}],
}
As shown, the dimension filter clauses contain two primary fields.
The first field, operator, only contain two possible options “AND” or “OR”.
If “AND” is selected, all filtered in the “filters” field below must apply for a row to be selected.
On the other hand, if “OR” is selected, a row can pass any of the conditions list in the “filter” field to pass this specific filter clause and be selected.
Now let’s look into the filters field, which contain a list of DimensionFilter Object illustrated below:
{
“dimensionName”: string,
“not”: boolean,
“operator”: enum(Operator),
“expressions”: [
string
],
“caseSensitive”: boolean,
}
In most cases, dimension filtering are placed based on whether a specific dimension satisfy a certain textual requirement, such as containing a word, starting with a word, and so forth.
In this object, dimensionName illustrate the name of the dimension that you want to filter your rows based on.
Now let’s talk about the “not” field, a inverse selector that, if set to be true, will select everything that does NOT meet the conditions of this filter.
It is particularly useful for excluding domains that you don’t want the filter to include, such as your internal traffic.
Operator is a list of possible operations that you can do to identify textual match of that dimension, and contain the following, mostly self-explanatory, options:
REGEXP, BEGINS_WITH, ENDS_WITH, PARTIAL, EXACT, NUMERIC_EQUAL, NUMERIC_GREATER_THAN, NUMERIC_LESS_THAN, IN_LIST.
REGEXP is the default option for this field and it means regular expression, a very popular way for string matching in natural language processing.
Essentially, regular expression is a series of characters that describe a string pattern many strings could fall under.
For example, if I want to find all strings that starts with AB and end with DE, I can use the regular expression “^AB.*DE$”, which means: the string starts (^) with AB, and then there can be from zero to many (*) of any characters (.), before the string end ($) with DE.
To learn more about regular expression, reference the link below:
Let’s now move onto our next parameter: expressions, which accepts a list of strings.
This is where you place the content of your matching expression, whether that’s “mywebsite” to work with a “BEGINS_WITH” operator to find anything that starts with your website, or a more complicated regular expression string like the one we showed you above.
Notice here that unless you are using the “IN_LIST” operator, only the first entry of the list will be considered by the system. And if you are using the “IN_LIST” operator, it will only select dimensions that are contained within the list.
Set the “case sensitive” option to be true if you want your filtering to be case sensitive, else false.
Let’s summarize this parameter with a few notes on how all the filters work together when combined.
The structure of this parameter is slightly more confusing than other parameters we have seen beforehand so let’s review its structure again.
Within this parameter, you can submit multiple dimensionFilterClause objects, which may contain multiple dimensionFilter objects.
Within each dimensionFilterClause, you can use the operator option to decide whether you want a row to be selected if all of these dimensionFilters are true (“AND”), or if only one of them are true (“OR”).
Going up another level, if you have multiple dimensionFilterClause objects submitted this parameter, all of those dimensionFilterClauses have to pass as true for a row to be selected (the “AND” operator), and there are no ways of changing this.
metricFilterClauses
Input: list of metricFilterClause Object
Now let’s talk about metric filter clauses, which have a structure very similar to dimension filter clauses but with primarily focus on filtering out rows that does not meet a certain metric threshold.
{
“operator”: enum(FilterLogicalOperator),
“filters”: [
{object(MetricFilter)}],}
Once again, the operator clause can be either “AND” or “OR”, illustrating whether you want only 1 or all of the filters placed below to be true for the row to be selected.
Moving onto the filters field, which consist of metricFilter object that looks like the following:
{
“metricName”: string,
“not”: boolean,
“operator”: enum(Operator),
“comparisonValue”: string,
}
Once again, similar structure, so let’s just glance through all of those.
metricNames is the name of the metric you want to set your threshold on, and if you want to select everything that’s NOT true, set not to be true.
For operators, since value comparison is now the primary task, they are different than that of the dimensionfilters, but equally as self-explanatory:
EQUAL, LESS_THAN, GREATER_THAN, IS_MISSING
Then the comparisonValue option, while it taking a string, usually take the string representation of a number. This number works with the operator to make the filtering work for this selector.
Same rule of aggregation apply here as for the dimensionFilterClauses:
Within each dimensionFilterClause, you can use “AND” or “OR” in the operator field to decide whether you want one or all to be true. However, for aggregation among dimensionFilterClause objects, they all have to be true (“AND” filter applies).
Way 2: filtersExpression
Please don’t hate me after reading this — the second way is actually much simpler than the first way.
Not that I am trying to make you read more or being a masochistic person that want to see you struggling to wrap your mind around both metric and dimension filters — I believe the first way of doing filtering is a lot more intuitive and clean than the way I am about to show you.
In the second way, we are rid of complex nested objects, and instead will only deal with one parameter that only accepts a single string — filtersExpression.
filtersExpression
Input: a single, lonely, string
The principle here, however, is the same.
While the format of the input are vastly different, the second way is surprisingly similar to the first one, but without the object structure to make everything neatly organized.
Essentially, you are submitting a long string that represent all filters you want to place on your data, with each of the filtering being in the following format:
ga:name operator expression
The ga:name field is the name of the dimension/metric that you want to filter on. Operators, just like before is how you would like to filter it, and expression is the content of the filter.
However, since this particular parameter is a leftover from the version 3 of the GA Reporting API, all the operators are represented with a symbol, instead of the text enumeration in the previous method.
There are too many filter and representations to be covered here, but you can find a reference of all of them in the link below:
Let’s put all the discussion in an example.
For example, if I want to create a filter to filter out all urls that contain the word “ugly”, and those with under 10 sessions, here will be the two filters I create:
“ga:pagepath!@ugly” and “ga:sessions>=10”, where the !@ means “does not contain the substring”.
To combine those two filters, we use the same logic of “AND” and “OR” as we had before. But this time the “AND” operator is represented with “;”, where as “OR” is represented as “,”.
Then the full filter string will be “ga:pagepath!@ugly;ga:sessions>=10” since we want both to hold true at the same time.
Notice that in the string the “OR” operator always take precedence over the “AND” operator, so that “ga:sessions>=10;ga:pagepath=@ugly, ga:pagepath=@bad;” means we are including traffic with 10 sessions that either has (=@) ugly or bad in its pagepath, despite the fact that the “AND” operator came first in the formula.
Notice that, when filtering, you might find your URL contain one of the operator representations such as “;” or “,”, which will cause confusion when processing your filter request.
To solve this, simply add an escape clause before those operators such as “\;” to show that you are using those strings literally, instead of as an operator.
Now let’s talk about why this filtering method sucks.
First of all, this string filtering method cannot accommodate a lot of the complex filtering need users may have.
If I want to do the following with my data:
- For all urls between 1000 and 2000 sessions, I want to only include those with bounce rate under 50%
- For all urls between 2000 and 3000 sessions, I want to only include those with bounce rate under 40%
The “string filtering” way will simply not able to accommodate this analytical need as it does not offer a clear way for us to group all the filter clauses together, but the first way we show you can accommodate this need easily via the metricFiltering option.
Secondly, the string way of filtering is hard to read and inherently confusing, which makes debugging and interpreting other people’s filter query a lot harder.
If you have a slightly more complex filtering need, your string extremely long. To fully understand your query string, a person need to spend a long time trying to decipher what you are trying to do, making passing down knowledge very difficult — a problem that is much less severe with the object structure in the first way.
Overall, even though I think this is an inferior way of taking care of filtering compare with the first way, its simplicity deserve to be applauded, and you should use this filtering method in your request if your filter request is not too complex, and will not run into those issues explained above.
However, never use two methods at the same time — that would just be naughty.
Alright, so here we will wrap up another installment of this series. As explained in the introduction, everything we talked about so far should be sufficient for your to conduct most analyses using the Google Analytics API.
Next, we are going to moving onto helping you actually submitting requests in Google Analytics by creating your own python authentication and request client, and processing your analytical requests using basic python.
I am aware that we skipped three major topic areas of the request itself, knowingly segmentation, pivot, and cohort analysis.
Those skips are intentional, and I will not forget to cover them in future installment.
However, I really want to get started with something that you can actually use in your day-to-day operations, instead of lingering too much on discussing theoretical topics such as those.
Anyhow, I will see you all in the next installment of the series!
This article was produced by Humanlytics. Looking for more content just like this? Check us out on Twitter and Medium, and join our Analytics for Humans Facebook community to discuss more ideas and topics like this!