Analyzing Mobile Users Activity with ViyaDB

Michael Spector
Oct 19, 2017 · 7 min read
Image for post
Image for post

Not long ago, I worked on a very interesting and technologically challenging use case. Let’s say you have a stream of events, where each event represents a mobile application user activity. Such activity can start even before an application gets installed on a device, usually, whenever a user is “engaged” with a mobile application by viewing or clicking on an advertisement. A significant event, of course, is the install, which acts as a “starting point” for all the subsequent events that might be generated throughout application usage.

Now, if we group our users by installation times and by the advertisement sources they came from, we can spot some interesting patterns. For example, some users tend to stay with an application (or produce more revenue through it) if they come from a specific source. A particular time of day/week might generate more installs etc. Marketers care about these patterns as they help them identify where to invest their money next time to get the best ROI.

How do all this translate from the marketing talk to a technical challenge? Here are several chalanges that make data engineer’s life less easy:

When it comes to choosing a database capable of handling this kind of workload, there are several conclusions that we can make:

There’s a single writer (usually, events are ingested from multiple sources, then written to some kind of streaming system that optionally pre-aggregates them into small batches from where they can be written into a relevant database instance).

The Solution

We’ve tried several “out of the box” solutions when I originally worked on this problem, but none of them was really great — either because it did more than required (e.g. ACID compliance through write ahead log, which slowed down the ingestion process significantly), or it provided pure analytical performance because it was designed for OLTP workloads from the beginning. From another perspective, re-inventing a wheel can be justified sometimes, especially when custom wheel solves a specific problem better than all the wheels in the world:

Image for post
Image for post

The first implementation was very naive: a vector of records stored in RAM contiguously, which enabled both fast updates and full scans. Even though the filtering operations consisted of a dynamic code full of branches, it was machine-sympathetic enough to give a performance boost by 50% over the most modern “out of the box” solutions.

The next version was written from scratch as part of the open source ecosystem that I’m trying to establish. It has better data modeling (column based vs. row/record based), run-time code generation mechanisms allowing avoiding CPU branch mispredictions and improve cache locality, more supported data types, etc.

Let’s play with it a bit!

Querying

I didn’t have any real data to test the database against, but I wrote a simple simulator that generates mobile application user events. Let’s generate 100 millions of events:

docker run --log-driver=none --rm -ti -e EVENTS_NUMBER=10000000 \ -e OUTPUT_FORMAT=tsv viyadb/events-generator:latest > data.tsv

Now, let’s start a ViyaDB instance, and load all the events:

docker run -p 5000 : 5000 --rm -ti \ -v $(pwd):/tmp/viyadb viyadb/viyadb:latest

To create a table, save table definition in table.json file:

{
"name": "activity",
"dimensions": [
{"name": "app_id"},
{"name": "user_id"},
{
"name": "event_time", "type": "time",
"format": "millis", "granularity": "day"
},
{"name": "country"},
{"name": "city"},
{"name": "device_type"},
{"name": "device_vendor"},
{"name": "ad_network"},
{"name": "campaign"},
{"name": "site_id"},
{"name": "event_type"},
{"name": "event_name"},
{"name": "organic", "cardinality": 2},
{"name": "days_from_install", "type": "ushort"}
],
"metrics": [
{"name": "revenue" , "type": "double_sum"},
{"name": "users", "type": "bitset", "field": "user_id"},
{"name": "count" , "type": "count"}
]
}

To learn more about ViyaDB data modeling please refer to the documentation.

Then send a CREATE TABLE command to the database via HTTP endpoint:

curl -d @table.json http://localhost:5000/tables

To load the data, create a load descriptor, and save it in load.json file:

{
"table": "activity",
"format": "tsv",
"type": "file",
"columns": [
"app_id", "user_id", "event_time", "country", "city",
"device_type", "device_vendor", "ad_network", "campaign",
"site_id", "event_type", "event_name", "organic",
"days_from_install", "revenue"
],
"file": "/tmp/viyadb/data.tsv"
}

Then send a load command to the database:

curl -d @load.json http://localhost:5000/load

After loading all of the 100 millions of events we will find that this single database instance process occupies about 11 GB of RAM, which is not very much for a single partition.

While the data is loading, we can start querying it already. For this purpose, we can use Zeppelin with ViyaDB interpreter installed. Note that first queries take more time because of the compilation step, but all subsequent queries of the same type run significantly faster.

Finding top 10 applications by install count

Image for post
Image for post

Query times: first time — 1.6 secs, second time — 74 ms.

Application event types statistics

Image for post
Image for post

Query times: first time — 1.7 secs, second time 40 ms.

Simple retention query

The following query shows number of unique sessions since install date grouped by ad network:

Image for post
Image for post

Query times: first time — 1.9 secs, second time — 60 ms.

As we can see, the fact that we’re loading new data at about 300K records / sec have very little effect on query performance time.

What about query concurrency? The following chart shows how single database instance attached to one CPU core scales when using different read concurrency level (from 1 to 6):

Image for post
Image for post

Clustering

Creating a clustered solution that makes everyone happy is not trivial, and frankly no generic clustering solution works great for every business. Instead of finding a way to please everyone a business-specific clustering infrastructure can be developed in a more straightforward manner. For this to work we have to have several constraints (more correctly, degrees of freedom) which open up options :

Keeping these points in mind, one shared-nothing approach that can be developed is:

A sample diagram of such approach can be found here. It would be interesting for me to see whether such design be adopted by any company, and what can be improved.

Next Steps

I’m planning to continue evolve the open source ecosystem around ViyaDB. There are three directions for development that will help make it more useful:

I hope this project becomes useful by filling the niche no other solution I foud, really solves. And, as they say in such cases, please come and help the project by providing your opinion or by contributing!


Originally published at https://www.linkedin.com on October 19, 2017.

ViyaDB

ViyaDB Development Blog

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store