Analyzing 1 Billion of User Events with ViyaDB

Published in

ViyaDB

4 min readFeb 26, 2018

ViyaDB is rapidly developing open source in-memory OLAP database that facilitates ad-hoc queries on unordered data (in contrast to time-series data). This short post explains how to load 1 billion of user events into ViyaDB cluster, and run some SQL queries on them.

But first, a couple of words about the anatomy of ViyaDB cluster. ViyaDB implements “shared nothing” architecture, where every worker holds a specific partition replica, and all the workers are coordinated using Consul.

This can be presented as follows:

To read more about ViyaDB clustering, see this page: http://viyadb.com/clustering

Installing ViyaDB Cluster

First, you must have Consul installed somewhere.

Put the following JSON under “viyadb/clusters/cluster001/config” key:

{
  "replication_factor": 1,
  "workers": 32,
  "tables": ["events"]
}

Put the following JSON under “viyadb/tables/events/config” key:

{
  "name": "events",
  "dimensions": [
    {"name": "app_id"},
    {"name": "user_id", "type": "uint"},
    {
      "name": "event_time", "type": "time",
      "format": "millis", "granularity": "day"
    },
    {"name": "country"},
    {"name": "city"},
    {"name": "device_type"},
    {"name": "device_vendor"},
    {"name": "ad_network"},
    {"name": "campaign"},
    {"name": "site_id"},
    {"name": "event_type"},
    {"name": "event_name"},
    {"name": "organic", "cardinality": 2},
    {"name": "days_from_install", "type": "ushort"}
  ],
  "metrics": [
    {"name": "revenue" , "type": "double_sum"},
    {"name": "users", "type": "bitset", "field": "user_id", "max": 4294967295},
    {"name": "count" , "type": "count"}
  ],
  "partitioning": {
    "columns": [
      "app_id"
    ]
  }
}

Then, let’s launch four AWS EC2 instances of type r4.2xlarge. This instance type has 8 vCPU, which will allow us to have 8 workers per host (32 workers in total). This will translate into having 32 partitions if using a single replica per partition.

ViyaDB is built against g++-7, that’s why it’s easier to use one of Ubuntu Artful (17.10) AMI provided by Canonical. Make sure, all TCP ports of ViyaDB nodes are open to each other. Also, make sure there’s enough disk space for downloading and extracting several tens of gigabytes of JSON files representing user events.

Once all four machines are running, proceed with the following steps on every one of the nodes.

Install g++-7 by running the following commands:

apt-get update
apt-get install g++-7
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 60 --slave /usr/bin/g++ g++ /usr/bin/g++-7

Create file /etc/viyadb/store.json with the following contents (replace consul-host with real Consul hostname):

{
  "supervise": true,
  "workers": 8,
  "cluster_id": "cluster001",
  "consul_url": "http://consul-host:8500",
  "state_dir": "/var/lib/viyadb"
}

Download ViyaDB package, and extract it:

curl -sS -L https://github.com/viyadb/viyadb/releases/download/v1.0.0-beta/viyadb-1.0.0-beta-linux-x86_64.tgz | tar -zxf -

Run ViyaDB instance:

cd /opt/viyadb-1.0.0-beta-linux-x86_64
./bin/viyadb /etc/viyadb/store.json

Loading Data

There’s utility under bin/ directory of ViyaDB package called vsql, which can be used for sending SQL queries to ViyaDB cluster.

Open the SQL shell utility, and connect it to one of ViyaDB hosts:

/opt/viyadb-1.0.0-beta-linux-x86_64/bin/vsql viyadb-host 5555

We have 1 billion of user events stored in gzipped TSV format in AWS S3. To load them, issue the following query:

COPY events (app_id, user_id, event_time, country, city, device_type, device_vendor, ad_network, campaign, site_id, event_type, event_name, organic, days_from_install, revenue, count
) FROM 's3://viyadb-test/events/input/'

Querying

Disclaimer: all queries were recorded when running for a second time, to exclude query compilation time from the total query running time.

Count different event types for a given list of applications:

ViyaDB> SELECT event_type,count FROM events WHERE app_id IN ('com.skype.raider', 'com.adobe.psmobile', 'com.dropbox.android', 'com.ego360.flatstomach') ORDER BY event_type DESCevent_type   count
----------   -----
session      117927202
install      263444
inappevent   200466796
impression   58431
click        297Time taken: 0.530 secs

Find 5 top cities in USA by install count for a specific application:

ViyaDB> SELECT city,count FROM events WHERE app_id='com.dropbox.android' AND country='US' AND event_type='install' ORDER BY count DESC LIMIT 5;city           count
----           -----
Clinton        28
Farmington     20
Madison        18
Oxford         18
Highland Park  18Time taken: 0.171 secs

Find top 10 ad networks for a specific application:

ViyaDB> SELECT ad_network, count FROM events WHERE app_id='kik.android' AND event_type='install' AND ad_network <> '' ORDER BY count DESC LIMIT 10;ad_network    count
----------    -----
Twitter       1257
Facebook      1089
Google        904
jiva_network  96
yieldlove     79
i-movad       66
barons_media  57
cpmob         50
branovate     35
somimedia     34Time taken: 0.089 secs

Find user retention (number of distinct user sessions) for the first week, per three major ad networks:

ViyaDB> SELECT ad_network, days_from_install, users FROM events WHERE app_id='kik.android' AND event_time BETWEEN '2015-01-01' AND '2015-01-31' AND event_type='session' AND ad_network IN('Facebook', 'Google', 'Twitter') AND days_from_install < 8 ORDER BY days_from_install, users DESCad_network  days_from_install  users
----------  -----------------  -----
Twitter     0                  33
Google      0                  20
Facebook    0                  14
Twitter     1                  31
Google      1                  18
Facebook    1                  13
Twitter     2                  30
Google      2                  17
Facebook    2                  12
Twitter     3                  29
Google      3                  14
Facebook    3                  11
Twitter     4                  27
Google      4                  13
Facebook    4                  10
Twitter     5                  26
Google      5                  13
Facebook    5                  10
Twitter     6                  23
Google      6                  12
Facebook    6                  9
Twitter     7                  22
Google      7                  12
Facebook    7                  9Time taken: 0.033 secs

Conclusion

As you can see, it’s pretty easy to set up ViyaDB cluster, and use it for calculating different aggregations over your data. Visit http://viyadb.com for more examples and documentation.

Thanks!

Analyzing 1 Billion of User Events with ViyaDB

Installing ViyaDB Cluster

Loading Data

Querying

Conclusion

Written by Michael Spector