Analytics for billions of records — Part 1

fritz
Life At Moka
Published in
7 min readMar 30, 2020

At Moka, we believe that data is one of the important resources for empowering businesses to grow. Thousands of our merchant partners rely on our platform and ecosystem to run their business every single day. This is resulting in us processing several millions of transactions each day.

We are fortunate enough to have this kind of opportunity to be able to work with billions of records every day. While the number itself is already massive and brings a headache for any data engineer, doing analytics on top of them is even scarier. We need a robust data platform to be an enabler to do analytics on that scale.

This series of posts would tell a story on our journey to solve these problems. Please grab a cup of coffee and enjoy.

Billions of things

1 billion is a massive number. But, how big it really is in our everyday life? To give a context and sense about the number, let’s do some exercise.

1 billion seconds is approximately 31.7 years. Not until 1 billion seconds ago, Sir Tim Berners-Lee invented the internet, in 1989.

1 billion minutes is approximately 1.901 years. The Pantheon temple started to be built around 1 billion minutes ago in the age of Roman Empire, 119 AD.

1 billion hours is approximately 114.080 years. If we go back to 1 billion hours ago, we are still in the late Ice Age and mammoths are still roaming the earth.

1 billion meters is almost 3 times the distance Earth to the Moon.

1 billion kilometers is six times over distance from Earth to the Sun.

1 billion dollars. This viral video gives enough context on how much is 1 billion dollars.

Analyse 1 billions of records

Enough with the money. The fact about 1 billion is quite fascinating, isn’t it? But, what if we do have 1 billion records of data and want to do some analysis for that data? What should we do to complete the task?

One could argue to accomplish this we should implement the full-blown Big Data technology stack. It is not entirely wrong, but we preferably avoid it if could. Honestly, we are not ready yet for that commitment. Big Data which is synonymous with Hadoop technology and its rich ecosystem surely could help, but it comes with a price and whole lot of responsibility to deal with to run it properly especially on a mission-critical application in production.

But, what if I told you, there is a way and we still could do analysis for 1 billion records without the need to break a bank?

From Russia with love

ClickHouse is a powerful data warehouse analytical database engine to do large scale analytics on Big Data. Originally developed in Yandex.Metrica, one of the largest web analytics platforms in the world from Russia. Storing more than 13 trillion records in the database and with volume 20 billion events for each day. If billions are not cool enough, try trillions. ClickHouse later is released as an open-source project under Apache 2.0 license.

Many companies around the world take advantage from what ClickHouse offers and successfully run in production for many years to mission-critical applications that are very hungry for data.

“ClickHouse не тормозит.”

Quote is taken from ClickHouse Core Developers that is literally translated to, “ClickHouse do not have brakes”.

ClickHouse is one of the fastest analytical databases that available in the market today. Many independent benchmarks suggest that. It even has comparable performance with a GPU database that is powered by thousands of core¹.

What makes the ClickHouse so great lies in the true columnar-oriented storage combined with parallel vectorized query executions. This enables ClickHouse to do analysis and crunching not only billions but trillions of records with the speed as fast as thought. This feature itself deserves a special blog to get deep into it. We would go deep that topic on another blog. Stay tuned.

Please check out the complete feature on the official site of ClickHouse here.

Without further ado, let’s get our hands dirty and try it ourselves to do analysis for billions of records. We are about to prove the ClickHouse statement.

Setup ClickHouse

Set up vanilla install ClickHouse is one of the easiest it could ever get to install a database server. Only, need to run less than 10 commands to get it up and running. You just need any linux server, CentOS in our test, and you are good to go.

$ sudo yum install yum-utils
$ sudo rpm --import https://repo.yandex.ru/clickhouse/CLICKHOUSE-KEY.GPG
$ sudo yum-config-manager --add-repo https://repo.yandex.ru/clickhouse/rpm/stable/x86_64
$ sudo yum install clickhouse-server clickhouse-client
$ sudo service clickhouse-server start
$ clickhouse-client

If you think above command is still a very long steps, try run it on docker.

$ docker run -d --name some-clickhouse-server --ulimit nofile=262144:262144 yandex/clickhouse-server
$ docker run -it --rm --link some-clickhouse-server:clickhouse-server yandex/clickhouse-client --host clickhouse-server

Our ClickHouse server is up and running. We are ready to load some data.

Preparing the dataset

The sample dataset that we would use for our test drive is the infamous New York Taxi Data. It has a total around 1.3 billion records.

The New York Taxi raw CSV data have a size of around 250GB in total. Please make sure you have enough storage to process and store the data. 500GB is good enough for a starter.

We are thoroughly following the detailed instruction on the ClickHouse docs page on how to load the taxi data to our ClickHouse database.

… and 2 days later, the data is ready to load to ClickHouse from PostgreSQL dump. It took around 3 hours to load and prepare data that is from PostgreSQL dump and ready to use.

If you don’t have time, you could use the prepared datasets. Although, it takes quite some time to download 130GB of data, but still a lot faster than 2 days though.

$ curl -O https://clickhouse-datasets.s3.yandex.net/trips_mergetree/partitions/trips_mergetree.tar
$ sudo tar xvf trips_mergetree.tar -C /var/lib/clickhouse
$ sudo service clickhouse-server restart

Finally, all the data is loaded to ClickHouse, now we are ready to do analysis for billions of records.

Fasten your seatbelt

The New York Taxi data is already sitting nicely in our ClickHouse server. But, what are exactly the total records of our dataset? Do you intrigued like us? Let’s find out.

FYI, we are using EC2 m4.xlarge instance type in AWS as our testbed server. It has 4 vCPU and 16GB RAM with 500GB SSD EBS storage. Your result may vary with ours.

SELECT count()
FROM trips_mergetree
┌────count()─┐
│ 1298979494 │
└────────────┘
1 rows in set. Elapsed: 0.002 sec.

So, the total data in our dataset is really almost 1.3 billion records, the exact number is 1.298.979.494, and yes, you are not reading it wrong, to count all the total records, ClickHouse only needs less than a fraction of a second. 0.002 sec.

Let’s try some real query, how about we find out what is the average passenger for all time data.

SELECT avg(passenger_count)
FROM trips_mergetree
┌─avg(passenger_count)─┐
│ 1.6672353351253133 │
└──────────────────────┘
1 rows in set. Elapsed: 1.573 sec. Processed 1.30 billion rows, 1.30 GB (825.91 million rows/s., 825.91 MB/s.)

It is amazing that to do average for the 1.3 billion records, ClickHouse only needs less than 2 sec.

Once upon a time. You are asked by your boss to find out what the minimum percentage tip is given for each trip in 2015. Break it down to a month. Because he/she wants to see if there is a trend or not in tip giving. While your boss explains his/her request to you, you are multitasking-ly writing SQL script and then firing it to your ClickHouse server.

SELECT
toMonth(pickup_date) AS month,
round(avg(fare_amount), 2) AS avg_fare,
round(avg(tip_amount), 2) AS avg_tip,
round((avg_tip / avg_fare) * 100, 2) AS tip_percent
FROM trips_mergetree
WHERE toYear(pickup_date) = 2015
GROUP BY month
┌─month─┬─avg_fare─┬─avg_tip─┬─tip_percent─┐
│ 1 │ 11.89 │ 1.78 │ 14.97 │
│ 2 │ 12.33 │ 1.61 │ 13.06 │
│ 3 │ 12.65 │ 1.63 │ 12.89 │
│ 4 │ 12.82 │ 1.64 │ 12.79 │
│ 5 │ 13.2 │ 1.67 │ 12.65 │
│ 6 │ 13.12 │ 1.68 │ 12.8 │
│ 7 │ 12.96 │ 1.63 │ 12.58 │
│ 8 │ 12.98 │ 1.61 │ 12.4 │
│ 9 │ 13.22 │ 1.67 │ 12.63 │
│ 10 │ 13.19 │ 1.71 │ 12.96 │
│ 11 │ 13.05 │ 1.71 │ 13.1 │
│ 12 │ 13.12 │ 1.68 │ 12.8 │
└───────┴──────────┴─────────┴─────────────┘
12 rows in set. Elapsed: 1.863 sec. Processed 165.35 million rows, 1.65 GB (88.76 million rows/s., 887.62 MB/s.)

While your boss is thinking about whether it is possible to ask you to give the result by the end of the day. Less than 2 sec later, you told your boss that turns out the minimum tip given is over 10% for each month and you emailed the details. Your boss is amazed by how fast you get the answer. Your boss is happy, you get a raise. You are happy. Everybody is happy. End.

Closing

Based on our simple test, We proved that ClickHouse really doesn’t have a break. It really-really fast crunching billions of our test data. This is opening a new way to achieve analytics for billions of records with the speed of thought. Relatively easy to get started with compared to another analytical database engine.

At Moka, we incorporated this beast to empower analytics features in our product to serve analytics for thousands of our merchants.

Disclaimer

There are vast options out there to solve the problem to analyse billions of records as we explain in this blog. ClickHouse is perfect to solve our problem and suitable for our use case. Your use case may differ and may need a different solution as well. Please always refer to the ClickHouse docs for more detail.

Reference

If you are interested in joining a team to solve a problem like this, dealing with massive datasets and large scale data platforms. Do not hesitate to drop us a message. Please share if you find this blog useful and thank you for reading.

career.mokapos.com

--

--

fritz
Life At Moka

Big Data Analytics | Data Engineering | Machine Learning