In my previous post, I talk about Google Cloud Storage and CloudSQL and today I want to talk about BigQuery. If you remember, my current project is involving those 3 services and you might wonder why don’t I make a single article talking about those 3 services altogether. Well, it is simply because if I make a single article about BigQuery, Google Cloud Storage and CloudSQL, it’s gonna be 15 minutes or more article, which honestly I am to lazy to read that kind of article on Medium. So I figure that I need to make separate article for BigQuery. I mean I’m sure that I’m not the only one who confused about what is so useful from a database that takes at least 2 second to process simple select query, right? To make it more practical I will also include a simple tutorial on how you access BigQuery using NodeJS. If you more interested in CloudSQL and Google Cloud Storage you can read my previous article here.
BigQuery is not free
Yes let’s face the hardest truth from BigQuery which is it’s not free. Every time you run query, Google will charge you based on how big the data involved in that query. For further detail you can see this page. But basically because it is not free you need to make sure your query is efficient to avoid unnecessary charge. The simplest rule of thumb is that never use * to select the data you need unless it’s truly necessary. Please do habituate yourself to use query like “SELECT column1,column3,… FROM table” since it reduce the amount of data that your query need to access in order to finish the request.
BigQuery is not a transactional database
Transactional database is just a fancy term for database like MySQL, PostgreSQL, MongoDB etc that we use store and access data in live production. Actually, it is possible to use BigQuery as a transactional database but you need a very patient user that doesn’t complain about how your apps is keep wasting their precious 2 second every time it load data. Here is a simple query that I run on BigQuery dashboard:
As you can see it takes 1.6 second to run such a simple query on a 88.2 KB table with 481 rows. You need to remember that it’s gonna be worst if you access it on NodeJS (2s) or PHP(6s). So, yeah, good luck on finding such a patient client.
Therefore if you want to use BigQuery, you better prepare other database as a transactional database to store the result of calculation or process that you get from BigQuery. That exactly why in the current project I use CloudSQL but you can use any database actually.
BigQuery is for big data
I think you can guess from its name “BigQuery” that the Big part here is talking about Big Data. But the question is how big Big Data is and what benefit we can get when we use BigQuery? Well, for the first question it is the ultimate question for every data science and the answer is different for each person and environment. But for BigQuery ask a simple question to yourself such as Do you need to store for a long time and do a complex operation on 10GB+ data? The real benefit of BigQuery is how it doesn’t kill itself when you run a query on such a big data. I mean just try to aggregate 10GB+ data on MySQL and see what happen.
BigQuery is for complex query
If you use BigQuery just for simple SELECT query, you truly waste the potential of BigQuery and more importantly you really waste your money. I mean look this example of how you create Neural Network from pure BigQuery query. I’m not saying that it is efficient and better to create Neural Network using only BigQuery but at least you can get how powerful the query can be. BigQuery offer various function that will help you implement complex calculation easier. For further detail you can see this page. The best part of this, is that for such a complex query, BigQuery performance is far outcompete conventional database. Yes indeed for simple query BigQuery is slower but the bigger the data is and the more complex the query you run, BigQuery will show its true power compared to conventional database.
BigQuery is “fully managed”
When we talk about what fully managed is actually means we will not find any universal definition since it is so broad. But in BigQuery “fully managed” means that developer doesn’t need to take care of the infrastructure and database administration. You don’t need to think about how to deploy the cluster, how to configure compression and encryption or how to setup your disk when you need to scale your data. As what mention here, scaling in BigQuery is as simple as a matter of using BigQuery more. It is certainly good deal for developer because now we can focus more on building the query. For your information, Google also doing improvement and maintenance for BigQuery infrastructure that user use every week, but you don’t need to worry about the downtime because you will not even notice when they do the maintenance. I also don’t know how they do it, so let’s just say that it’s one of those magic thing that Google Engineer can do.
Send query from NodeJS to BigQuery
Now it’s time for us to do some practical implementation by using NodeJS to send a query to NodeJS. First, you need to get the JSON credential file from GCP dashboard in APIs and Services dashboard, for further explanation you can visit my previous post. After that you need to install node package for BigQuery called @google-cloud/bigquery using this following command:
npm install --save @google-cloud/bigquery
There are 2 method to access BigQuery from NodeJS which are synchronous and asynchronous. The difference is that for asynchronous method NodeJS will create a job on the GCP while for synchronous NodeJS will access the BigQuery directly so it need to wait until the query is processed. This following code is the implementation of both synchronous and asynchronous access to BigQuery from NodeJS:
Before we use the code above we need to prepare the table that we want to access. For now let’s just copy the table from public data that google provide. For this article I will use table called bigquery-public-data:hacker_news.full. In case you don’t know how to copy the table, you just need to go to the public data table that you want to copy and click “copy table” then define where to copy the table as follows:
Here I already create dataset called machine_learning beforehand, so in case you have no dataset yet, you need to create one.
After you copy the table, let’s get back to Node.js and let me show you how to send simple SELECT query to BigQuery using this following code:
After you run the code above, you will see the following result:
As you can see that for asynchronous access, Node.js will create a job and will display the result after the job is completed. The result from querying a BigQuery table will be return as an array of object in Node.js. If you want to run the synchronous access you just need to change asyncQuery into syncQuery. The main difference is like what I’ve mention previously, for synchronous access doesn’t create a job for BigQuery so it needs to wait and block the next process until the Node.js get the return from BigQuery.