Solving BigData with BigQuery (1/3): Intro

Sarin Achawaranont

Published in

Cleverse, a Web3 Focused Venture Builder

7 min readApr 4, 2018

— Or how I get started with writing technical blogs

The Thai version of this article is also available at Cleverse Blog :)

Managing and operating big data infrastructure sounds like a pain — from the time-consuming installation process, streaming large data in, to querying from these massive data set. Not to mention those delicate tune-up and maintenance required.

Introducing BigQuery, a fabulous tool in which we used it as our backbone at Cleverse, my workplace, to process and analyze Big Data.

Cleverse, a tiny startups
with a goal to analyze enormous data
Where should we begin?

It begins when one day, my boss came up with a new data product. He gained access to a large set of data, and set us on a mission to extract meaningful insights from these data.

This project begins we a team of few developers: three, including me. Almost everyone has zero experience in any data project. So we gained an opportunity to explore many data analytics tools on these data set.

I — Dealing with a big dataset

Initially, we started with utilizing ElasticSearch. Mostly because it can store both structured data type and unstructured data type. It also support every query we had in mind and provides a means for visualization with Kibana. However, we give up on it due to the increased costs and performance issues.

Afterward, we took a step back to explore alternative SQL/NoSQL solutions, traditional databases, namely: MySQL, PostgresSQL, MongoDB, and Redis. To sum it up, these are some of the issues we have with them:

Various performance issues
Due to their structure, these databases require a vast amount of resources and gives a worse performance when dealing with a larger dataset. Without months of preparation or a costly infrastructure, the server may explode from dozens of a complex query. It also gets slower and slower when we put in more data, to the point that it’s not applicable. These issues can be fixed by improving the underlying infrastructure, such as increasing the RAM, using faster disks, or making replica sets. But well, at that point, the increasing costs are not worth it.
Expensive to set up, maintain, and monitor
My time also counts as costs. It’s no joke when your boss told me that the server is down, during his ongoing demonstration, while I’m enjoying my overseas trip during my holidays. Not to mention that time when I woke up in the middle of the night to monitor the server performance status and prayed it would be alright because the clients overseas were trying it. There’s also the time I need to study the installation process and the best practices for each tool. Then it gets even harder to determine which infrastructure to support them: a high-spec machine may be able to support almost every use cases, but since we want to build a product with a reasonable price tier, these are not a good choice. For us, a team of a few developers, these issues are time-consuming and they prevent us from developing other parts we would like to.
In the end, it rejects all my data insertions
We need to continuously stream a large chunk of data into the database all the day. There are duplicate rows from time to time. At one point the database can not take in any more data, and we have to slow the process down, or even give up on some data. Initially, we tried to solve it by increasing the specs of the machines that host these databases and by sharding, separating database server into smaller instances to spread the load. Then one day, these data got so large that we couldn’t make the database functions anymore.
Predicting the costs (and the usage loads) is hard
Before launching any new feature, we have to calculate the costs/loads and see if they’re worth it. But it’s hard to predict all the costs/loads beforehand — not without any historical usage statistics, an experienced data engineer, or a Pokemon. At that time I only given two pieces of information: the characteristics of queries, and the affected parts of data. It took me a while to measure those loads and came up with the usage costs for just one feature.

After months of trial and error, I gave up and asked Google.

I wish Google Assistant could help me like this in real life. sigh…

It then introduced me a new tool: Google BigQuery.

II — Introducing Google BigQuery

A fast, highly scalable, cost-effective and fully-managed enterprise data warehouse for analytics at any scale

Google BigQuery is an enterprise data warehouse built on top of Google Cloud Platform. It’s serverless, no installations required. BigQuery works great with data of any size, from a tiny Excel spreadsheet to a data on the Petabytes scale, and can execute a complex query on those data within a few seconds

Usage

BigQuery supports SQL-like query, which makes it easy to use, just like any basic database we all knows. It’s accessible via its web UI, command-line tool, or client library (Supports C#, Go, Java, Node.js, PHP, Python, and Ruby). On unsupported platforms, we can use it by sending a JSON request to its REST API.

Serverless — Leave it to Google

BigQuery is built on top of Google’s BigTable and Google Cloud Platform. It requires virtually no installation, maintenance, and monitoring. No need to hire additional workers to take care of those. It’s a good choice for a small team like us. Furthermore, it scales to your data automatically with no additional costs on these features. The performance remains the same when my data gets larger and larger.

Connectivity and Integration

BigQuery also integrates well with other analytics tools. For instance, Hadoop and Spark can read or write data directly from BigQuery with the uses of Cloud Dataproc and Cloud Dataflow. Google also provides automatic transferring of data from other Google services, such as DoubleClick, AdWords, and YouTube Analytics. It also supports popular BI tools like Tableau, MicroStrategy, Looker, and Google DataStudio with no further setups required, so you can quickly create reports and dashboards straight away.

Usage — When to use BigQuery?

Due to its implementation, BigQuery supports DML actions, but it isn't suitable for update/delete operations, so it shouldn't be used as an online transaction processing (OLTP) store. On the other hand, BigQuery works great as a data warehouse (LDAP) for a historical data, such as event logs, analytical data, or a time series data. ( — Further reading)

Based on my experience,
BigQuery ingests all these large data streams with virtually no constraints,
computes all my complex queries within a few seconds,
accepts parallel requests with ease.
We encountered almost no failure ever since we begin using BigQuery!
In short, it’s fabulous!
Not to mention its costs.. woops..

III — BigQuery: cost-effective ?

BigQuery pricing data from https://cloud.google.com/bigquery/pricing

One of the things I love about BigQuery is its pricing model, which is pay-as-you-go. You only need to pay as much (data) as you use, with no minimums. There’s also a free tier available for everyone. Its costs are mainly based on these criteria:

Amount of existing data in the storage — $0.02 per GB, per month
Amount of inserted data — $0.05 per GB
The total amount of data used to process each queries — $5 per TB
(In my case, 90%+ of thecosts come from this)

What’s so great about this? It makes me predict the costs easily based on those two things I know earlier: the characteristics of queries, and the affected parts of data. No need to think about those hard things like server costs, or any additional infrastructure monitoring costs. When a new feature arrives, I can predict all the prices by just taking a look at those queries and calculating how much data it needs.

This pricing model makes BigQuery a perfect choice when you have a large data sets, does not want to spend many costs on monitoring and managing, and need the system to be up and operating fast.

BigQuery may seem like a cheap and a perfect tool, but wait,

Based on those numbers, it may costs you up to $5,000 if you query a 500 GB data for 2,000 requests!

Well, there’s a workaround for you to lower the costs and may get them down to just $50 (probably), in which I’ll talk about in the later parts of this series :)

… to be continued in:

- Part 2/3 : BigQuery usage, constraints, and use cases
- Part 3/3 : BigQuery costs optimization