Building your own Google Analytics. A detailed step-by-step guide
This is the second part of our three-part story about analytics systems. Here you can find Part 1 which answers the questions about client analytics services.
Data is the new oil. Or the new coal. You name it. Beyond doubt, today quick and correct data processing and the following interpretation of results may lead almost every kind of business to success. Analytics systems are getting overloaded with dozens of parameters along with the numbers of triggers and user events which are growing rapidly. For this reason, companies provide their analytics departments with loads of new information to study in order to later convert it into right decisions.
In a nutshell, analytics systems significance should not be underestimated. These days almost every enterprise already uses some kind of such system chosen from a wide range of possibilities: from light tools for online shops (who sell tasty donuts for example) to complicated ones, like those which were developed specially for Elon Musk’s rocket launches analysis. If you are a newbie to this analytical world, which one should you choose as a company? Let’s start with exploring 2 most common types of analytical systems: client analytics services and server-side analytics.
Client analytics services
Client analytics service is a service which enterprise connects to its website or app using official SDK, then integrates to the codebase and specifies trigger events. Such approach has an obvious disadvantage: all the data gathered cannot always be processed the way you would like since there are often some limitations depending on the solution. For instance, in one system it would not be easy to run a MapReduce task, in another — you won’t be able to run any custom model. Surely, the service cost is regular and, generally speaking, high.
There is a wide range of client solutions on the market, however, sooner or later analysts finally admit there is no universal service suitable for every unique task whereas costs are getting higher and higher. Having reached this point of understanding companies often decide to create their own analytical system with all the custom settings needed and requested by decision makers. If you would like to learn more about the specifics of work with different client services, read our previous article on this topic.
Server-side analytics is an analytics service which is deployed within the company on its own servers and usually by its own means. In this model, all user events are stored on this server providing developers with opportunities to try different databases for storage and then choose the best one (or even several ones). Still want to process your data by other analytics services? Sure, such use case is also possible.
Server-side analytics can be deployed in several ways. The first way is to choose open source utilities, deploy them on your own servers and thoroughly develop business logic.
The second way is to use a SaaS solution (made by Amazon, Google, Azure) instead of deploying your own. Part 3 of our article will cover SaaS option in more detail.
How to build analytic step-by-step
If you don’t want to integrate client analytics services and prefer to create your own, you at first have to build the architecture. Below we describe the steps of which this ‘invention’ process consists and list the instruments that can be used.
1. Data Query
Just like in the case of client analytics, first of all, data scientists and analysts put all the events of interest together and prepare a list for the following data gathering. These events usually happen in a specific order which forms a so-called ‘events scheme’.
Next, imagine that your app (website, mobile/desktop app) has a number of regular users and is hosted on your servers. To make the transfer of events from all these devices to servers safe an intermediate layer is needed. Also, different events queues may be produced if you track more than one app.
In our example, where we have lots of data producers and consumers (our devices and servers), Kafka helps to connect them. Consumers will be described below as they are the main actors of the next step. Now let’s concentrate on the data producers.
For instance, if our app supports 2 operational systems, it is better for each to have its own events stream. Producers send their events to Kafka where they are put to the end of the suitable topic.
At the same time, Kafka allows consumers to later read and process these streams by mini-batches. Kafka is a convenient instrument which may be easily scaled upon your needs (for example, by the location of events).
Kafka can be configured to be very efficient in terms of throughput but its infrastructure management comes at a cost. Working with one shard is ok but things are getting more complicated upon scaling (as usually). Probably, you would not use only one physical shard for production solution because the architecture needs to be fault tolerant.
Besides, there is one more well-known solution called RabbitMQ. However, we never used it for production for client analytics queue (if you did, please, tell us about your experience).
Before going further to the next step we should mention one more possible additional layer called raw logs storage. It is optional and will be useful only in case something goes wrong and Kafka queues go empty (something that should never happen) or if you use something else than Kafka. Storing raw logs doesn’t require a sophisticated and expensive solution, you just need to have them written somewhere in the right order (at least on an HDD).
2. Stream Processing
After we have prepared the events and put them in a queue, we are moving on to the processing part. Here we’ll describe 2 most common options.
The first is to connect for processing Spark Streaming service (which also comes from the Apache ecosystem). Everything coming from Apache hosts in HDFS (which is a very safe file system with file replicas). Spark Streaming is comfortable to use tool which processes the data and scales well. However, it might be a bit hard to maintain.
The second option is to develop your own processor. Here you need to create a Python app, build it inside a Docker container and subscribe to Kafka streams. As triggers come to the processor, the processing of events starts. This method requires Python apps to be running permanently.
Let’s assume we have chosen one of the options above and switch to the processing itself. Processors should start with events validity check (so as to drop trash and invalid events). We usually use Cerberus for the validation task.
After that, you are ready to map the data: events from different sources should be normalized and standardized before being written together to the database.
The fourth step is about how to store your previously normalized events. Right after the development of the analytics system, you will make loads of queries to the database and it’s significant to choose the most convenient one.
If events data follows some fix scheme it’s worth considering Clickhouse or any other column-oriented database. The aggregations will be extremely fast. But one day fix scheme may also become a disadvantage: if any unusual events suddenly occur, you won’t be able to add them without additional preprocessing. However, to highlight one more time, the speed of these solutions is great.
For unstructured data you may choose NoSQL DB, say, Apache Cassandra. It’s based on the HDFS, can be replicated many times and allows to raise several instances. It is fault tolerant as well.
One more opportunity is to try a lighter database like MongoDB is. This database is rather slow and is more suitable for medium volumes of data. As a plus, it is very simple and, for this reason, is a nice choice to start with.
4. Aggregation service
Having stored the events in the best available database, you probably would like to finally get some important information from all this data gathered and update in the database. Globally, the objective of the processing is to create dashboards and calculate metrics. A common example is the events aggregation into user profiles. Events are aggregated and written down to the database again, this time in an aggregated form (for example, in a form of user profiles). Moreover, this aggregation system can also have some filters: suppose, you would like to enrich user profiles with only certain types of events but not every new event. Then a filter should be connected to the events coordinator so that it will send only a part of events to the aggregation service.
The results of these custom aggregations are also stored in the database. Next, it’s possible to integrate an external analytics service to our system. In case your market analyst would like to get insights from user behavior without any SQL queries, you can integrate Mixpanel into your system. But since we remember that Mixpanel is rather pricey, we’d like to send only chosen events to it. To do this we need to create another coordinator within our server-side analytics and make it send specific events or aggregations to the external analytics services, APIs or advertising platforms.
The final step is the creation and integration of a front-end to our system. A nice plain example is redash. It is a GUI for databases which helps users to create panels. Following simple steps:
- Making an SQL query
- Receiving a table
- Creating a ‘new visualization’, you will get a pretty plot which can be exported.
The visualizations update automatically so you can set your own dashboards and track them. Redash is free when it is self-hosted, otherwise, prices start from $49/month (for a SaaS).
Having gone through all the steps we have described above, you will create your own server-side analytics system. Here we warn you that the whole way is not simple at all as you need to tune everything yourself as a team. So, we recommend you to compare your needs in analytics and resources you are ready to allocate on the task and then decide if you really need to create such a sophisticated system.
If you find the costs too high, there is a way to create a cheaper server-side analytics which we’ve described in our Part 3 of the analytics series.
Thank you for reading! Please, ask us questions, leave your comments and stay tuned! Find us at https://potehalabs.com