How we implemented a funnel and user journey tracker service with ElasticSearch

Talha Ocakçı
Commencis
Published in
10 min readAug 16, 2018

“Real-time response” is the key for hard engineering questions. Even if the questions are “solved” already by the community, implementing it on our own really challenges us on selecting the correct technology. How do we make the response real-time with millions maybe billions of data points?

We got the requirement of implementing a funnel, used for finding the conversion and churn rate of users in step — defined by the user dynamically— in a given time period. Each member in the team came up with different approaches.

A funnel query should filter event attributes and device properties, but only the values at the generation of an event. It should also filter user properties — current values which may change anytime.

Another challenge is that steps do not have to occurred successively. Some other events (with no number limit) may occur between steps. You need to skip them.

We wanted to include all three requirements:

  • real-time
  • dynamic funnel querying, with no a priori aggregation of data
  • skipping the intermediate events

Excluding the real-time requirement would prevent lots of headaches. We would mail the results “later on” and we would need much more time.

Excluding the dynamic funnel querying would allow us to aggregate the streaming data into predefined funnels and let us respond in real-time.

That dilemma made us think hard to pick an appropriate technology. Nothing can be done in memory since billions of data cannot be loaded into memory. That’s why we need a persistent data source.

But: How can persistent data — with millions of data points — be queried in real-time after sorting the data according to creation times?

First, that made us think about time-series databases. And then graph databases. Because the data seems like a bunch of branches consisting of a bunch of branches that is consisting of a bunch of…

Trying to find the correct technology

This is where a company has to decide between two options. The first is spending the human resources for implementing the very same thing with different technologies. The second is throwing N-1 implementations into the trash bin, concluding with a drain in the budget.

Thanks to Commencis, we tried three different technologies. We abandoned two of them over time, and picked up the third one which is ElasticSearch with custom script implementations.

Our forth alternative was the Hadoop ecosystem. But we did not rely on that ecosystem, since we did not have enough experience at that time.

Let me tell you our trial and error story.

Trying and giving up time-series DB

At first sight, funnel implementation is a one-line for time-series data. So, put the events into the DB and everything will be sorted. Looks easy and reasonable. But wait, billions of data points of millions of users will be in the same line.

But, for checking whether a user moved to an event, possibly you need to traverse all the lines. All the tables…

First attempt at constructing an optimized DB structure

Let’s symbolize the user-event pairs as below. The first value is the user ID and second is the event ID. Event attributes, device properties, and user data is omitted for simplicity.

User activities as a time-series dataset

Let’s find the users that realized the A and C events, regardless of the time window and the events between them.

User 1 realized A B B C events in this series. So they completed A and C events. To figure this out, we traversed events of all users. We could not jump to the 1-C event without hitting the events between them.

Second attempt

We tried to break the line into users by selecting and seperating the activities.

Grouping the events by user id

1: A B B C

2: A A

So, one of the series simply can be shown as below for just one user:

Activities of user 1

We need to traverse all the lines for checking all the users. If user completed the desired series, we will add it to our related list. But nothing has changed because we need to traverse all the events without being able to skip any activity.

Third attempt

We need to prevent traversing all the events that are not related to the events in funnel steps. That’s why we tried to group events by event name this time. The keys would be event names, and values would be the user id and other metadata of the activity.

A: 1 2 3 2 4

B: 1 1 4

Now, since our funnel consists of events A and C, we may eliminate all other events in order to minimize the dataset.

This attempt seemed successful. But, when we tried to write DB queries, we found that combining and sorting all the data of events A and C was required. Again, we needed to do a sort operation first and then a linear search for each user. This may cause N*M iterations on the dataset, where N is user count and M is event count.

Time complexity is somewhere between N and , which may not result in a real-time query when data grows quickly.

Trying and abandoning Neo4J

From visiting web pages and looking at mobile activities, I realized some events seemed to be in some type of graph DB. So I tried to implement different graph structures.

First attempt

I used event names as nodes and realizations as edges. I would find if a user completed all the events or — if they left the sequence — where they left.

The below chart simply shows the history of the activities of user 1. The timestamp can be thought of as the Unix epoch timestamp. Simply, user realized activity A at time 100 as the first activity of the session, then activity A at time 108, etc…

We can find the users that realized A and C succesively with a pseudo query like this:

A -> [userid:x] -> B

A-> [userid:x] -> B -> [userid:y] -> C where x = y

Getting the difference of the second query from the first query will yield the user IDs that left the sequence on step C.

The query is so simple. This is one of the pros of a graph DB. Cool. But our requirement was to skip the intermediate events. This is done by a query like below. Note the “*” characters, which represent skipping the intermediate events.

A-> [userid:x]* -> C -> [userid:y]* -> D where x = y

Even if “skip character” seems harmless, it simply causes a full scan of the table without any indexes, which can be a disaster. This is because GraphDB is good at querying the nodes, not edges. My second attempt will be replacing nodes and edges.

Second attempt

As we did in the time-series DB, a tweak is required for preventing a full scan of a graph DB. Thats why I used each event realization as a node, and metadata as edges. This resulted in billions of nodes.

Occurrence of A -> [time and user data] -> Occurrence of B

The resulting graph structure can be seen below. There is an isolated graph per user. If you take a look at the graph carefully, you will see it is exactly the same as our time-series approach.

We may query the users that performed A and B events successively with such a query:

select user from (event name= A) -> user:User -> (event name = B)

This is aligned with our second attempt of time-series. All other events are eliminated but, again, a linear scan of the table needs to be performed. So, clearly, we did not leverage the graph query features.

Conclusions so far

Working with these two technologies, through trial and error we figured out:

  • Eliminating the events that are not related to events inside the funnel is required.
  • A linear scan is required at the last step. This may have a high time complexity and also memory complexity, since we need to sort the events in memory.
  • We need a well-designed data structure. Inserting each event according to our structure is a must. We cannot use the raw data for querying.
  • If we used an intermediate data structure and performed a linear scan, fancy technologies did not help us. This was cumbersome for managing that stack.

Reverting back

Our conclusions led us to re-using our technology stack. So, what we will do now is:

  • create an intermediate document to store data in order to query the funnel in real time
  • query that intermediate document with as low time complexity as possible by eliminating the useless events
  • select the true method with ElasticSearch

Implementing a funnel with ElasticSearch

ElasticSearch is a document-based data store for really fast search and data aggregation using the index structure.

Documents are mostly JSONs and are stored in indexes. As the indexes grow, search performance decreases.

ElasticSearch scripts are used to run some operations on each document, one by one.

ElasticSearch scripts are really fast if you don’t hit the source of the document. Only hit and load the attributes into memory since the attributes might be indexed for fast access.

First attempt

Our first script attempt was using the source() method to load the whole document into memory, and then access the attributes of it. Our intermediate document was holding all the events of a single user all the time.

Intermediate document sample for user 1 activities

That was a linear search of all events in each user. The JSON says user1 triggered two events at different times in this time order.

This JSON will grow over time, and maybe one year of data will be stored in a single document. This will cause a disastrous query performance.

That’s why we break these documents into one-day chunks. This was our single “planning” restriction for our funnel implementation. If the user triggers an event on 23:59 today, and 00:01 tomorrow, we will not detect that the funnel is completed. This was a drawback for our real-time requirement.

That’s why we decided to further partition thepageView and buttonClick events into smaller parts. We did it by appending the crucial attributes (identifier attributes) to the event names. For instance, if the clicked button is the “register” button, the event name would be “buttonClick-register” such as below:

We chunked the evnts into event smaller parts.

That eliminated the repetition of the viewId attributes and we successfully partitioned pageView.

If a user viewed 1000 pages that particular day, and the funnel was querying the transition from login to home-screen, we would be dealing with only a few events.

We tried our query against 1M events, and the query time was 40 seconds. This is far, far away from our expectation. Finding the real cause was cumbersome. It took us three days.

The problem was using the source() method and parsing JSON. To overcome this problem, we needed to use plain strings but not JSON objects such as below. Now each attribute corresponds to a very specific activity and the value is a string that is containing all realization times combined with dashes. With these structures, we are not using JSON parsing overheads and source methods.

Here we overcame another limitation of ElasticSearch: An attribute must be smaller than 32KB to be indexed. If we had not partitioned the events, we would have easily exceeded the 32KB limit.

Now we are able to index the attributes in the documents and access the values without the source() method in the script. This reduces the time by 95%.

After that, to be able to parse the attributes with regular expressions, we activated regular expressions for our ElasticSearch mapping and wrote the script for finding the funnel. I am not showing the details of the script, but it is simply a linear search on the related event attributes.

These small structure changes reduced the query time to one to two seconds. This may be counted as real-time or near real-time.

The drawback is that we cannot respond to a funnel query such as “after getting a push notification, which users are directed to any page”. We lost the ability to querying any page.

But, when we think about the real use cases, this case does not seem reasonable. Why would a customer want to know the transition to any page? Useless.

Conclusions

Finding the correct technology may sometimes require time and money to be spent on attempts that we might throw into the trash bin. We spent weeks in time-series database and graph databases. Even if this damages the budget, it is a great technology assessment for future projects. For instance, we may check for the linearity of the data before going with a graph DB.

Some compromises on the requirements must be ignored. We need to test our limits for success. If we hadn’t tried to reach real-time, we would have stayed in our comfort zone.

Some limitations to the features must be done for unlocking the requirements. In our case, “events must occur in the same day for a specific customer,” so day transitions will fail to be detected. A second limitation was “any page queries will not be responded to.”

Each technology may have some limitations according to their architectures. This architecture will modify our thinking process. In our case, the 32 Kb limitation was the key point.

A data structure may adapt to different technologies. In our case, partitioning by user, by event name, and by time resulted in the same performance enhancements.

I thank Can Elmas and Commencis for not making us feel under budget pressure throughout our attempts.

You can also find me on Twitter: www.twitter.com/talhaocakci

--

--

Talha Ocakçı
Commencis

Yazılarımda bazı şeyleri yanlış anlatıyor olabilirim. Kesin yanlış anlatıyorumdur.