A lot has been already said about google analytics on the digital platform. Then why this new blog? Well, mainly because even with all the help available online it still took us weeks to access the data. We realized that there is still some space for a blog that can “de-mystify” some air around Google Analytics — what’s paid, what’s free and how to use it to track your business. Learning from our experience, we have tried to help you with modularized chunks of code (available on our git here) to help you quickly set up your analytics platform using web traffic data.
Getting access to the data
It’s quite simple to access GA data if you know how to work using google APIs. If you’re a newbie or you have no time to spend on learning how those codes run we are here to help you out! All you need is a JSON file (from your website) and our ga_connect.py (from our git repository)and few simple steps to follow —
- create your key file — ‘client_secret.json’ which helps you connect to google’s core reporting API. follow this link which will guide you to generate ‘client_secret.json’ file for your website. You need to have a service account linked to your website. This link will direct you through the steps needed to set up the service account
- Place client secret, ga_connect.py in the same folder as your Jupyter notebook
- use this piece of the code to establish a connection and access GA data
Google gives you access to download whatever you can see on your GA dashboard. But there is a catch! The granularity of the accessible data depends if you are a premium or a non-premium account holder. GA offers a premium membership that gives access to user-event level information and it comes with an annual fee of $150k to be paid to Google. But even with non-premium accounts, we have access to pretty much whatever we can see on the dashboard — all we need to define is the time range, metrics, and dimensions. Explore ‘metrics and dimension explorer’ to get a hold of it. Use the get_df function to directly import the downloaded data in a pandas data frame.
Setting up your analytics platform
Google Analytics has data available at vast combinations of metrics and dimensions. We were working for a client who was entering a new market and they were using their website to sell their products.
Any business can benefit from knowing how many customers they can expect in the future. We cannot emphasize enough on the benefits that come with knowing this one number. Depending on the kind of business, it can help reduce uncertainties in different parts and answer pressing questions like “How much inventory should I keep?” or “How do I allocate my resources” — salesforce in our case.
While in our case knowing the number of visitors on the website tomorrow may not help too much (since the website is already up and running and designed to handle moderate traffic), knowing the number of people who will “enroll” for the services in the future can help us forecast our demand (and many other things) which can in turn answer a lot of important questions.
Forecasting conversions using GA
We worked on creating a ‘self-learning time series forecasting model’ to predict enrollments in the future. The entire model is a pipeline of two separate models — a seasonal ARIMA model and a dynamic regression model that works on top of the seasonal ARIMA model to predict future enrollments. For web traffic data, there are high chances of observing seasonality (weekly, monthly, yearly, etc) in user flow and hence we decided to use the seasonal ARIMA model as the first step to forecasting. We built a ‘self-learning’ SARIMA model by auto-tuning the hyperparameters for a seasonal ARIMA model. The following piece of the code shows how a grid of hyperparameter can be defined using our code to configure the seasonal ARIMA model.
A seasonal ARIMA model consists of parameters that need to be configured correctly to capture the order, seasonality, and trend in the data. We included the plausible values for the 8 SARIMA parameters to be explored in the grid search and the code returns the best 3 models based on RMSE using a walk-forward validation approach.
For a time-series forecasting model, some variables are not directly dependent on the model but they do affect the model in one way or the other. For example — a website might attract more traffic on weekdays compared to the weekends. Similarly, a website might be reached more frequently during the start and the end days of the months — think of a website where payment needs to be made during the start of each month. In our dynamic regression model apart from the ‘best’ SARIMA model from the previous step we also included these ‘exogenous’ variables which can help better predict the user enrollment in the future.
The above code can be used to forecast any time series data. For SARIMA auto-tuning — define the grid and pass the time-series to be forecasted in the form of a pandas series object. The list of exogenous variables can be altered based on the context in which forecasting is performed.
Once we know the forecast for the demand in the future, understanding what drives that demand can be very valuable, especially for a nascent company. GA tracks a myriad of attributes about the users — the channel they came through, their age, their gender, the device they are using, etc.
We use these ‘dimensions’ to estimate the effectiveness of different user attributes — to answer questions like “are more females subscribing or more males?” or “which marketing campaign is driving more conversions?”, or more broadly “what’s working for us? what does our typical customer look like?”. If we can answer these questions, we can try to target ‘lookalike’ leads who might be more likely to sign up and hence make our marketing efforts more effective.
To estimate the importance of these dimensions, we built a model where our response variable was a derived metric — ‘Pass-Through Rate’ (PTR) and we used the number of ‘Good Sessions’ split by different dimensions on a particular day as the regressors (Refer to Audience tab of your GA dashboard)— this was the maximum level of granularity we could achieve with the free version of GA. We defined the PTR as the percentage of signups that got completed on that day. “Good sessions” were the count of sessions that didn’t bounce off — GA defines a bouncy session where the user didn’t interact at all with the website and bounce rate is the ratio of bouncy sessions upon total sessions for that day. We defined the number of ‘good sessions’ as session_count * (1 — bounce rate) at a daily level for each dimension split. For more details, please refer to our Jupyter notebook here.
Before putting all of them together in one model, we built separate models for each dimension with PTR as the target variable and used LASSO regularization for feature selection to reduce the number of regressors. LASSO by virtue of its interpretability and shrinkage properties, helps in deciding the features that are significantly affecting the response variable vs the non-significant ones by driving the betas for non-important variables to 0. Running LASSO on the individual dimension level models helped us get a list of the metrics that are significant by themselves for explaining our PTR.
Although LASSO helps in selecting the features, deciding the feature importance across dimensions in an interactive space is better achieved by using RandomForest regressor. Refer to this video and this link to understand how RandomForest works and how variable importance is calculated. Since our metrics were a day level summary of web traffic, we accounted for the autocorrelation in the data separately by treating PTR for the effect of lag before reporting the variable importance.
Put together, this simple analysis can help a website owner achieve actionable insights. By looking at the top important variables from our model, business insights can be drawn very easily — eg. good sessions that came through the AT&T network, from Apple devices, via any TV/radio/print media marketing campaign or the digital social Facebook campaign were most positively impacting our PTR, while visitors from age group 35–44 are significantly affecting our PTR negatively. From here, we can refine our marketing efforts to boost what is working and improve on what is going wrong!
After completing this exercise, we got a sense of how beautifully these ideas are intertwined — what we can affect today, is only part of what we want to improve. But if we learn how to identify and improve that part today, we can affect the whole of tomorrow — both directly and indirectly.
We hope you found this blog useful! Drop your comments if you face any difficulty and we will try our best to help you.